local-llm-server/README.md

# local-llm-server

_An HTTP API to serve local LLM Models._

**ARCHIVED PROJECT:** this project was created before any good solution existed for managing LLM endpoints and has now been superseded by many good options. [LiteLLM](https://github.com/BerriAI/litellm) is the best replacement. If a need for an un-authenticated public model arises, check out [cyberes/litellm-public](https://git.evulid.cc/cyberes/litellm-public).

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.


**Features:**

- Load balancing between a cluster of different VLLM backends.
- OpenAI-compatible API.
- Streaming support via websockets (and SSE for the OpenAI endpoint).
- Descriptive landing page.
- Logging and insights.
- Tokens and authentication with a priority system.
- Moderation system using OpenAI's moderation API.


## Install VLLM

The VLLM backend and local-llm-server don't need to be on the same machine.

1. Create a venv.
2. Open `requirements.txt` and find the line that defines VLLM (it looks something like `vllm==x.x.x`) and copy it.
3. Install that version of VLLM using `pip install vllm==x.x.x`
4. Clone the repo: `git clone https://git.evulid.cc/cyberes/local-llm-server.git`
5. Download your model.
6. Create a user to run the VLLM server.
   ```shell
   sudo adduser vllm --system
   ```

   Also, make sure the user has access to the necessary files like the models and the venv.

7. Copy the systemd service file from `other/vllm/vllm.service` to `/etc/systemd/system/` and edit the paths to point to your install location. Then activate the server.


## Install

1. Create a user to run the server:
    ```shell
    sudo adduser server --system
    ```

2. `mkdir /srv/server`

3. `git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server`

4. `sudo apt install redis`

5. `python3 -m venv venv`

6. `./venv/bin/pip install -r requirements.txt`

7. `chown -R server:nogroup /srv/server`

8. Create the logs location:
    ```shell
    sudo mkdir /var/log/localllm
    sudo chown -R server:adm /var/log/localllm/

9. Install nginx:
   ```shell
   sudo apt install nginx
   ```

10. An example nginx site is provided at `other/nginx-site.conf`. Copy this to `/etc/nginx/default`.
11. Copy the example config from `config/config.yml.sample` to `config/config.yml`. Modify the config (it's well commented).
12. Set up your MySQL server with a database and user according to what you configured in `config.yml`.
13. Install the two systemd services in `other/` and activate them.


## Creating Tokens

You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.


**Fields:**

- `token`: The authentication token. If it starts with `SYSTEM__`, it's reserved for internal usage.
- `type`: The token type. For your reference only, not used by the system (need to confirm this, though).
- `priority`: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
- `simultaneous_ip`: How many requests from an IP are allowed to be in the queue.
- `openai_moderation_enabled`: enable moderation for this token. `1` means enabled, `0` is disabled.
- `uses`: How many times this token has been used. Set it to `0` and don't touch it.
- `max_uses`: How many times this token is allowed to be used. Set to `NULL` to disable restriction and allow infinite uses.
- `expire`: When the token expires and will no longer be allowed. A Unix timestamp.
- `disabled`: Set the token to be disabled.


## Updating VLLM

This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the `SamplingParams` object in [llm_server/llm/vllm/vllm_backend.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/llm_server/llm/vllm/vllm_backend.py) match up with those in VLLM's [vllm/sampling_params.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/sampling_params.py).

Additionally, make sure our VLLM API server at [other/vllm/vllm_api_server.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/other/vllm/vllm_api_server.py) matches [vllm/entrypoints/api_server.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/entrypoints/api_server.py).

Then, update the VLLM version in `requirements.txt`.


## To Do

- [ ] Support the Oobabooga Text Generation WebUI as a backend
- [ ] Make the moderation apply to the non-OpenAI endpoints as well
- [ ] Make sure stats work when starting from an empty database
- [ ] Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
- [ ] Add test to verify the OpenAI endpoint works as expected
- [ ] Document the `Llm-Disable-Openai` header
Initial commit 2023-08-21 14:40:46 -06:00			`# local-llm-server`

restyle homepage, add config item to add content to the home page 2023-08-24 17:55:55 -06:00			`_An HTTP API to serve local LLM Models._`
use redis caching 2023-08-21 23:59:50 -06:00
archive project 2024-10-27 12:13:26 -06:00			`ARCHIVED PROJECT: this project was created before any good solution existed for managing LLM endpoints and has now been superseded by many good options. [LiteLLM](https://github.com/BerriAI/litellm) is the best replacement. If a need for an un-authenticated public model arises, check out [cyberes/litellm-public](https://git.evulid.cc/cyberes/litellm-public).`

minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00			`The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.`
update readme 2023-08-23 23:48:46 -06:00

fix gunicorn logging 2023-12-21 14:24:50 -07:00
ready for public release 2024-03-18 12:42:44 -06:00			`Features:`
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			`- Load balancing between a cluster of different VLLM backends.`
			`- OpenAI-compatible API.`
			`- Streaming support via websockets (and SSE for the OpenAI endpoint).`
			`- Descriptive landing page.`
			`- Logging and insights.`
			`- Tokens and authentication with a priority system.`
			`- Moderation system using OpenAI's moderation API.`
update readme 2023-08-23 23:48:46 -06:00


ready for public release 2024-03-18 12:42:44 -06:00			`## Install VLLM`
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			`The VLLM backend and local-llm-server don't need to be on the same machine.`
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			`1. Create a venv.`
			2. Open `requirements.txt` and find the line that defines VLLM (it looks something like `vllm==x.x.x`) and copy it.
			3. Install that version of VLLM using `pip install vllm==x.x.x`
			4. Clone the repo: `git clone https://git.evulid.cc/cyberes/local-llm-server.git`
			`5. Download your model.`
			`6. Create a user to run the VLLM server.`
			```shell
			`sudo adduser vllm --system`
			```
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			`Also, make sure the user has access to the necessary files like the models and the venv.`
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			7. Copy the systemd service file from `other/vllm/vllm.service` to `/etc/systemd/system/` and edit the paths to point to your install location. Then activate the server.
update readme 2023-08-23 23:48:46 -06:00


ready for public release 2024-03-18 12:42:44 -06:00			`## Install`
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			`1. Create a user to run the server:`
			```shell
			`sudo adduser server --system`
			```
update readme 2023-08-23 23:48:46 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			2. `mkdir /srv/server`
update readme 2023-08-24 12:19:59 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			3. `git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server`
update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			4. `sudo apt install redis`
minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			5. `python3 -m venv venv`
update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			6. `./venv/bin/pip install -r requirements.txt`
update readme 2023-08-24 12:19:59 -06:00
ready for public release 2024-03-18 12:42:44 -06:00			7. `chown -R server:nogroup /srv/server`

			`8. Create the logs location:`
			```shell
			`sudo mkdir /var/log/localllm`
			`sudo chown -R server:adm /var/log/localllm/`

			`9. Install nginx:`
			```shell
			`sudo apt install nginx`
			```

			10. An example nginx site is provided at `other/nginx-site.conf`. Copy this to `/etc/nginx/default`.
			11. Copy the example config from `config/config.yml.sample` to `config/config.yml`. Modify the config (it's well commented).
			12. Set up your MySQL server with a database and user according to what you configured in `config.yml`.
			13. Install the two systemd services in `other/` and activate them.



			`## Creating Tokens`

			`You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.`



			`Fields:`

			- `token`: The authentication token. If it starts with `SYSTEM__`, it's reserved for internal usage.
			- `type`: The token type. For your reference only, not used by the system (need to confirm this, though).
			- `priority`: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
			- `simultaneous_ip`: How many requests from an IP are allowed to be in the queue.
			- `openai_moderation_enabled`: enable moderation for this token. `1` means enabled, `0` is disabled.
			- `uses`: How many times this token has been used. Set it to `0` and don't touch it.
			- `max_uses`: How many times this token is allowed to be used. Set to `NULL` to disable restriction and allow infinite uses.
			- `expire`: When the token expires and will no longer be allowed. A Unix timestamp.
			- `disabled`: Set the token to be disabled.



			`## Updating VLLM`

			This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the `SamplingParams` object in [llm_server/llm/vllm/vllm_backend.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/llm_server/llm/vllm/vllm_backend.py) match up with those in VLLM's [vllm/sampling_params.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/sampling_params.py).

			`Additionally, make sure our VLLM API server at [other/vllm/vllm_api_server.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/other/vllm/vllm_api_server.py) matches [vllm/entrypoints/api_server.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/entrypoints/api_server.py).`

			Then, update the VLLM version in `requirements.txt`.



			`## To Do`

			`- [ ] Support the Oobabooga Text Generation WebUI as a backend`
			`- [ ] Make the moderation apply to the non-OpenAI endpoints as well`
minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00			`- [ ] Make sure stats work when starting from an empty database`
ready for public release 2024-03-18 12:42:44 -06:00			`- [ ] Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.`
			`- [ ] Add test to verify the OpenAI endpoint works as expected`
archive project 2024-10-27 12:13:26 -06:00			- [ ] Document the `Llm-Disable-Openai` header