An HTTP API to serve local LLM Models.

This repository has been archived on 2024-10-27. You can view files and clone it, but cannot push or open issues or pull requests.

Go to file

Cyberes 20366fbd08 misc adjustments		2024-05-07 22:56:36 -06:00
config	ready for public release	2024-03-18 12:42:44 -06:00
llm_server	misc adjustments	2024-05-07 22:56:36 -06:00
other	refactor a lot of things, major cleanup, use postgresql	2024-05-07 17:03:41 -06:00
templates	redo database connection, add pooling, minor logging changes, other clean up	2024-05-07 09:48:51 -06:00
.gitignore	actually we don't want to emulate openai	2023-09-12 01:04:11 -06:00
LICENSE	Initial commit	2023-08-21 14:40:46 -06:00
README.md	refactor, add Llm-Disable-Openai header	2024-05-07 17:41:53 -06:00
daemon.py	refactor a lot of things, major cleanup, use postgresql	2024-05-07 17:03:41 -06:00
requirements.txt	refactor a lot of things, major cleanup, use postgresql	2024-05-07 17:03:41 -06:00
server.py	refactor a lot of things, major cleanup, use postgresql	2024-05-07 17:03:41 -06:00

README.md

local-llm-server

An HTTP API to serve local LLM Models.

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

Features:

Load balancing between a cluster of different VLLM backends.
OpenAI-compatible API.
Streaming support via websockets (and SSE for the OpenAI endpoint).
Descriptive landing page.
Logging and insights.
Tokens and authentication with a priority system.
Moderation system using OpenAI's moderation API.

Install VLLM

The VLLM backend and local-llm-server don't need to be on the same machine.

Create a venv.
Open requirements.txt and find the line that defines VLLM (it looks something like vllm==x.x.x) and copy it.
Install that version of VLLM using pip install vllm==x.x.x
Clone the repo: git clone https://git.evulid.cc/cyberes/local-llm-server.git
Download your model.
Create a user to run the VLLM server.
```
sudo adduser vllm --system
```
Also, make sure the user has access to the necessary files like the models and the venv.
Copy the systemd service file from other/vllm/vllm.service to /etc/systemd/system/ and edit the paths to point to your install location. Then activate the server.

Install

Create a user to run the server:
```
sudo adduser server --system
```
mkdir /srv/server
git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server
sudo apt install redis
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
chown -R server:nogroup /srv/server

Create the logs location:

sudo mkdir /var/log/localllm
sudo chown -R server:adm /var/log/localllm/

Install nginx:
```
sudo apt install nginx
```
An example nginx site is provided at other/nginx-site.conf. Copy this to /etc/nginx/default.
Copy the example config from config/config.yml.sample to config/config.yml. Modify the config (it's well commented).
Set up your MySQL server with a database and user according to what you configured in config.yml.
Install the two systemd services in other/ and activate them.

Creating Tokens

You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.

Fields:

token: The authentication token. If it starts with SYSTEM__, it's reserved for internal usage.
type: The token type. For your reference only, not used by the system (need to confirm this, though).
priority: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
simultaneous_ip: How many requests from an IP are allowed to be in the queue.
openai_moderation_enabled: enable moderation for this token. 1 means enabled, 0 is disabled.
uses: How many times this token has been used. Set it to 0 and don't touch it.
max_uses: How many times this token is allowed to be used. Set to NULL to disable restriction and allow infinite uses.
expire: When the token expires and will no longer be allowed. A Unix timestamp.
disabled: Set the token to be disabled.

Updating VLLM

This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the SamplingParams object in llm_server/llm/vllm/vllm_backend.py match up with those in VLLM's vllm/sampling_params.py.

Additionally, make sure our VLLM API server at other/vllm/vllm_api_server.py matches vllm/entrypoints/api_server.py.

Then, update the VLLM version in requirements.txt.

To Do

Support the Oobabooga Text Generation WebUI as a backend
Make the moderation apply to the non-OpenAI endpoints as well
Make sure stats work when starting from an empty database
Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
Add test to verify the OpenAI endpoint works as expected
Document the Llm-Disable-Openai header