114 lines
4.5 KiB
Markdown
114 lines
4.5 KiB
Markdown
# local-llm-server
|
|
|
|
_An HTTP API to serve local LLM Models._
|
|
|
|
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
|
|
|
|
|
|
|
|
**Features:**
|
|
|
|
- Load balancing between a cluster of different VLLM backends.
|
|
- OpenAI-compatible API.
|
|
- Streaming support via websockets (and SSE for the OpenAI endpoint).
|
|
- Descriptive landing page.
|
|
- Logging and insights.
|
|
- Tokens and authentication with a priority system.
|
|
- Moderation system using OpenAI's moderation API.
|
|
|
|
|
|
|
|
## Install VLLM
|
|
|
|
The VLLM backend and local-llm-server don't need to be on the same machine.
|
|
|
|
1. Create a venv.
|
|
2. Open `requirements.txt` and find the line that defines VLLM (it looks something like `vllm==x.x.x`) and copy it.
|
|
3. Install that version of VLLM using `pip install vllm==x.x.x`
|
|
4. Clone the repo: `git clone https://git.evulid.cc/cyberes/local-llm-server.git`
|
|
5. Download your model.
|
|
6. Create a user to run the VLLM server.
|
|
```shell
|
|
sudo adduser vllm --system
|
|
```
|
|
|
|
Also, make sure the user has access to the necessary files like the models and the venv.
|
|
|
|
7. Copy the systemd service file from `other/vllm/vllm.service` to `/etc/systemd/system/` and edit the paths to point to your install location. Then activate the server.
|
|
|
|
|
|
|
|
## Install
|
|
|
|
1. Create a user to run the server:
|
|
```shell
|
|
sudo adduser server --system
|
|
```
|
|
|
|
2. `mkdir /srv/server`
|
|
|
|
3. `git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server`
|
|
|
|
4. `sudo apt install redis`
|
|
|
|
5. `python3 -m venv venv`
|
|
|
|
6. `./venv/bin/pip install -r requirements.txt`
|
|
|
|
7. `chown -R server:nogroup /srv/server`
|
|
|
|
8. Create the logs location:
|
|
```shell
|
|
sudo mkdir /var/log/localllm
|
|
sudo chown -R server:adm /var/log/localllm/
|
|
|
|
9. Install nginx:
|
|
```shell
|
|
sudo apt install nginx
|
|
```
|
|
|
|
10. An example nginx site is provided at `other/nginx-site.conf`. Copy this to `/etc/nginx/default`.
|
|
11. Copy the example config from `config/config.yml.sample` to `config/config.yml`. Modify the config (it's well commented).
|
|
12. Set up your MySQL server with a database and user according to what you configured in `config.yml`.
|
|
13. Install the two systemd services in `other/` and activate them.
|
|
|
|
|
|
|
|
## Creating Tokens
|
|
|
|
You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.
|
|
|
|
|
|
|
|
**Fields:**
|
|
|
|
- `token`: The authentication token. If it starts with `SYSTEM__`, it's reserved for internal usage.
|
|
- `type`: The token type. For your reference only, not used by the system (need to confirm this, though).
|
|
- `priority`: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
|
|
- `simultaneous_ip`: How many requests from an IP are allowed to be in the queue.
|
|
- `openai_moderation_enabled`: enable moderation for this token. `1` means enabled, `0` is disabled.
|
|
- `uses`: How many times this token has been used. Set it to `0` and don't touch it.
|
|
- `max_uses`: How many times this token is allowed to be used. Set to `NULL` to disable restriction and allow infinite uses.
|
|
- `expire`: When the token expires and will no longer be allowed. A Unix timestamp.
|
|
- `disabled`: Set the token to be disabled.
|
|
|
|
|
|
|
|
## Updating VLLM
|
|
|
|
This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the `SamplingParams` object in [llm_server/llm/vllm/vllm_backend.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/llm_server/llm/vllm/vllm_backend.py) match up with those in VLLM's [vllm/sampling_params.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/sampling_params.py).
|
|
|
|
Additionally, make sure our VLLM API server at [other/vllm/vllm_api_server.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/other/vllm/vllm_api_server.py) matches [vllm/entrypoints/api_server.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/entrypoints/api_server.py).
|
|
|
|
Then, update the VLLM version in `requirements.txt`.
|
|
|
|
|
|
|
|
## To Do
|
|
|
|
- [ ] Support the Oobabooga Text Generation WebUI as a backend
|
|
- [ ] Make the moderation apply to the non-OpenAI endpoints as well
|
|
- [ ] Make sure stats work when starting from an empty database
|
|
- [ ] Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
|
|
- [ ] Add test to verify the OpenAI endpoint works as expected
|
|
- [ ] Document the `Llm-Disable-Openai` header |