local-llm-server/README.md

# local-llm-server

_An HTTP API to serve local LLM Models._

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

### Install

1. `sudo apt install redis`
2. `python3 -m venv venv`
3. `source venv/bin/activate`
4. `pip install -r requirements.txt`
5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
6. `python3 server.py`

An example systemctl service file is provided in `other/local-llm.service`.

### Configure

First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but
eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.

Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.

1. Set `backend_url` to the base API URL of your backend.
2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.

To set up token auth, add rows to the `token_auth` table in the SQLite database.

`token`: the token/password.

`type`: the type of token. Currently unused (maybe for a future web interface?) but required.

`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

`uses`: how many responses this token has generated. Leave empty.

`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

`expire`: UNIX timestamp of when this token expires and is not longer valid.

`disabled`: mark the token as disabled.

### Use

If you see unexpected errors in the console, make sure `daemon.py` is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.

Flask may give unusual errors when running `python server.py`. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: `gunicorn -b :5000 --worker-class gevent server:app`

### To Do

- [x] Implement streaming
- [ ] Bring streaming endpoint up to the level of the blocking endpoint
- [x] Add VLLM support
- [ ] Make sure stats work when starting from an empty database
- [ ] Make sure we're correctly canceling requests when the client cancels
- [ ] Make sure the OpenAI endpoint works as expected
Initial commit 2023-08-21 14:40:46 -06:00			`# local-llm-server`

restyle homepage, add config item to add content to the home page 2023-08-24 17:55:55 -06:00			`_An HTTP API to serve local LLM Models._`
use redis caching 2023-08-21 23:59:50 -06:00
minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00			`The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.`
update readme 2023-08-23 23:48:46 -06:00
			`### Install`

			1. `sudo apt install redis`
			2. `python3 -m venv venv`
			3. `source venv/bin/activate`
			4. `pip install -r requirements.txt`
adjust some things 2023-09-12 01:10:58 -06:00			5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
actually we don't want to emulate openai 2023-09-12 01:04:11 -06:00			6. `python3 server.py`
update readme 2023-08-23 23:48:46 -06:00
			An example systemctl service file is provided in `other/local-llm.service`.

			`### Configure`

update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00			`First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but`
			`eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.`
update readme 2023-08-23 23:48:46 -06:00
update readme 2023-08-24 00:09:57 -06:00			Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.
update readme 2023-08-23 23:48:46 -06:00
			1. Set `backend_url` to the base API URL of your backend.
			2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.

			To set up token auth, add rows to the `token_auth` table in the SQLite database.

			`token`: the token/password.

			`type`: the type of token. Currently unused (maybe for a future web interface?) but required.

			`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

			`uses`: how many responses this token has generated. Leave empty.

			`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

			`expire`: UNIX timestamp of when this token expires and is not longer valid.

update readme 2023-08-24 12:19:59 -06:00			`disabled`: mark the token as disabled.

update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00			`### Use`

Merge cluster to master (#3) Co-authored-by: Cyberes <cyberes@evulid.cc> Reviewed-on: https://git.evulid.cc/cyberes/local-llm-server/pulls/3 2023-10-27 19:19:22 -06:00			If you see unexpected errors in the console, make sure `daemon.py` is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.
minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00
Merge cluster to master (#3) Co-authored-by: Cyberes <cyberes@evulid.cc> Reviewed-on: https://git.evulid.cc/cyberes/local-llm-server/pulls/3 2023-10-27 19:19:22 -06:00			Flask may give unusual errors when running `python server.py`. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: `gunicorn -b :5000 --worker-class gevent server:app`
update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00
update readme 2023-08-24 12:19:59 -06:00			`### To Do`

minor changes, add admin token auth system, add route to get backend info 2023-09-24 15:54:35 -06:00			`- [x] Implement streaming`
			`- [ ] Bring streaming endpoint up to the level of the blocking endpoint`
			`- [x] Add VLLM support`
			`- [ ] Make sure stats work when starting from an empty database`
			`- [ ] Make sure we're correctly canceling requests when the client cancels`
			`- [ ] Make sure the OpenAI endpoint works as expected`