Cyberes 3494106355 | ||
---|---|---|
config | ||
llm_server | ||
other | ||
templates | ||
.gitignore | ||
LICENSE | ||
README.md | ||
daemon.py | ||
requirements.txt | ||
server.py |
README.md
local-llm-server
An HTTP API to serve local LLM Models.
ARCHIVED PROJECT: this project was created before any good solution existed for managing LLM endpoints and has now been superseded by many good options. LiteLLM is the best replacement. If a need for an un-authenticated public connection to SillyTavern arises, check out cyberes/litellm-public.
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
Features:
- Load balancing between a cluster of different VLLM backends.
- OpenAI-compatible API.
- Streaming support via websockets (and SSE for the OpenAI endpoint).
- Descriptive landing page.
- Logging and insights.
- Tokens and authentication with a priority system.
- Moderation system using OpenAI's moderation API.
Install VLLM
The VLLM backend and local-llm-server don't need to be on the same machine.
-
Create a venv.
-
Open
requirements.txt
and find the line that defines VLLM (it looks something likevllm==x.x.x
) and copy it. -
Install that version of VLLM using
pip install vllm==x.x.x
-
Clone the repo:
git clone https://git.evulid.cc/cyberes/local-llm-server.git
-
Download your model.
-
Create a user to run the VLLM server.
sudo adduser vllm --system
Also, make sure the user has access to the necessary files like the models and the venv.
-
Copy the systemd service file from
other/vllm/vllm.service
to/etc/systemd/system/
and edit the paths to point to your install location. Then activate the server.
Install
-
Create a user to run the server:
sudo adduser server --system
-
mkdir /srv/server
-
git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server
-
sudo apt install redis
-
python3 -m venv venv
-
./venv/bin/pip install -r requirements.txt
-
chown -R server:nogroup /srv/server
-
Create the logs location:
sudo mkdir /var/log/localllm sudo chown -R server:adm /var/log/localllm/
-
Install nginx:
sudo apt install nginx
-
An example nginx site is provided at
other/nginx-site.conf
. Copy this to/etc/nginx/default
. -
Copy the example config from
config/config.yml.sample
toconfig/config.yml
. Modify the config (it's well commented). -
Set up your MySQL server with a database and user according to what you configured in
config.yml
. -
Install the two systemd services in
other/
and activate them.
Creating Tokens
You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.
Fields:
token
: The authentication token. If it starts withSYSTEM__
, it's reserved for internal usage.type
: The token type. For your reference only, not used by the system (need to confirm this, though).priority
: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.simultaneous_ip
: How many requests from an IP are allowed to be in the queue.openai_moderation_enabled
: enable moderation for this token.1
means enabled,0
is disabled.uses
: How many times this token has been used. Set it to0
and don't touch it.max_uses
: How many times this token is allowed to be used. Set toNULL
to disable restriction and allow infinite uses.expire
: When the token expires and will no longer be allowed. A Unix timestamp.disabled
: Set the token to be disabled.
Updating VLLM
This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the SamplingParams
object in llm_server/llm/vllm/vllm_backend.py match up with those in VLLM's vllm/sampling_params.py.
Additionally, make sure our VLLM API server at other/vllm/vllm_api_server.py matches vllm/entrypoints/api_server.py.
Then, update the VLLM version in requirements.txt
.
To Do
- Support the Oobabooga Text Generation WebUI as a backend
- Make the moderation apply to the non-OpenAI endpoints as well
- Make sure stats work when starting from an empty database
- Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
- Add test to verify the OpenAI endpoint works as expected
- Document the
Llm-Disable-Openai
header