Cyberes fd09c783d3 | ||
---|---|---|
config | ||
llm_server | ||
other | ||
templates | ||
.gitignore | ||
LICENSE | ||
README.md | ||
daemon.py | ||
requirements.txt | ||
server.py |
README.md
local-llm-server
An HTTP API to serve local LLM Models.
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
Features:
- Load balancing between a cluster of different VLLM backends.
- OpenAI-compatible API.
- Streaming support via websockets (and SSE for the OpenAI endpoint).
- Descriptive landing page.
- Logging and insights.
- Tokens and authentication with a priority system.
- Moderation system using OpenAI's moderation API.
Install VLLM
The VLLM backend and local-llm-server don't need to be on the same machine.
-
Create a venv.
-
Open
requirements.txt
and find the line that defines VLLM (it looks something likevllm==x.x.x
) and copy it. -
Install that version of VLLM using
pip install vllm==x.x.x
-
Clone the repo:
git clone https://git.evulid.cc/cyberes/local-llm-server.git
-
Download your model.
-
Create a user to run the VLLM server.
sudo adduser vllm --system
Also, make sure the user has access to the necessary files like the models and the venv.
-
Copy the systemd service file from
other/vllm/vllm.service
to/etc/systemd/system/
and edit the paths to point to your install location. Then activate the server.
Install
-
Create a user to run the server:
sudo adduser server --system
-
mkdir /srv/server
-
git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server
-
sudo apt install redis
-
python3 -m venv venv
-
./venv/bin/pip install -r requirements.txt
-
chown -R server:nogroup /srv/server
-
Create the logs location:
sudo mkdir /var/log/localllm sudo chown -R server:adm /var/log/localllm/
-
Install nginx:
sudo apt install nginx
-
An example nginx site is provided at
other/nginx-site.conf
. Copy this to/etc/nginx/default
. -
Copy the example config from
config/config.yml.sample
toconfig/config.yml
. Modify the config (it's well commented). -
Set up your MySQL server with a database and user according to what you configured in
config.yml
. -
Install the two systemd services in
other/
and activate them.
Creating Tokens
You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.
Fields:
token
: The authentication token. If it starts withSYSTEM__
, it's reserved for internal usage.type
: The token type. For your reference only, not used by the system (need to confirm this, though).priority
: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.simultaneous_ip
: How many requests from an IP are allowed to be in the queue.openai_moderation_enabled
: enable moderation for this token.1
means enabled,0
is disabled.uses
: How many times this token has been used. Set it to0
and don't touch it.max_uses
: How many times this token is allowed to be used. Set toNULL
to disable restriction and allow infinite uses.expire
: When the token expires and will no longer be allowed. A Unix timestamp.disabled
: Set the token to be disabled.
Updating VLLM
This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the SamplingParams
object in llm_server/llm/vllm/vllm_backend.py match up with those in VLLM's vllm/sampling_params.py.
Additionally, make sure our VLLM API server at other/vllm/vllm_api_server.py matches vllm/entrypoints/api_server.py.
Then, update the VLLM version in requirements.txt
.
To Do
- Support the Oobabooga Text Generation WebUI as a backend
- Make the moderation apply to the non-OpenAI endpoints as well
- Make sure stats work when starting from an empty database
- Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
- Add test to verify the OpenAI endpoint works as expected