local-llm-server/README.md

# local-llm-server

_An HTTP API to serve local LLM Models._

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to make changes to (or even switch) your backend without affecting your clients.

### Install

1. `sudo apt install redis`
2. `python3 -m venv venv`
3. `source venv/bin/activate`
4. `pip install -r requirements.txt`
5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
6. `python3 server.py`

An example systemctl service file is provided in `other/local-llm.service`.

### Configure

First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but
eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.

Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.

1. Set `backend_url` to the base API URL of your backend.
2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.

To set up token auth, add rows to the `token_auth` table in the SQLite database.

`token`: the token/password.

`type`: the type of token. Currently unused (maybe for a future web interface?) but required.

`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

`uses`: how many responses this token has generated. Leave empty.

`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

`expire`: UNIX timestamp of when this token expires and is not longer valid.

`disabled`: mark the token as disabled.

### Use

**DO NOT** lose your database. It's used for calculating the estimated wait time based on average TPS and response tokens and if you lose those stats your numbers will be inaccurate until the database fills back up again. If you change GPUs, you
should probably clear the `generation_time` time column in the `prompts` table.

### To Do

- Implement streaming
- Add `huggingface/text-generation-inference`
- Convince Oobabooga to implement concurrent generation
- Make sure stats work when starting from an empty database
- Make sure we're correctly canceling requests when the client cancels
- Implement auth and tokens on the websocket endpoint. Maybe add something to the instruct prompt and the remove it before proxying??
Initial commit 2023-08-21 14:40:46 -06:00			`# local-llm-server`

restyle homepage, add config item to add content to the home page 2023-08-24 17:55:55 -06:00			`_An HTTP API to serve local LLM Models._`
use redis caching 2023-08-21 23:59:50 -06:00
update readme 2023-08-23 23:48:46 -06:00			`The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to make changes to (or even switch) your backend without affecting your clients.`

			`### Install`

			1. `sudo apt install redis`
			2. `python3 -m venv venv`
			3. `source venv/bin/activate`
			4. `pip install -r requirements.txt`
adjust some things 2023-09-12 01:10:58 -06:00			5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
actually we don't want to emulate openai 2023-09-12 01:04:11 -06:00			6. `python3 server.py`
update readme 2023-08-23 23:48:46 -06:00
			An example systemctl service file is provided in `other/local-llm.service`.

			`### Configure`

update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00			`First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but`
			`eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.`
update readme 2023-08-23 23:48:46 -06:00
update readme 2023-08-24 00:09:57 -06:00			Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.
update readme 2023-08-23 23:48:46 -06:00
			1. Set `backend_url` to the base API URL of your backend.
			2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.

			To set up token auth, add rows to the `token_auth` table in the SQLite database.

			`token`: the token/password.

			`type`: the type of token. Currently unused (maybe for a future web interface?) but required.

			`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

			`uses`: how many responses this token has generated. Leave empty.

			`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

			`expire`: UNIX timestamp of when this token expires and is not longer valid.

update readme 2023-08-24 12:19:59 -06:00			`disabled`: mark the token as disabled.

update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00			`### Use`

log model used in request so we can pull the correct averages when we change models 2023-08-26 00:30:59 -06:00			`DO NOT lose your database. It's used for calculating the estimated wait time based on average TPS and response tokens and if you lose those stats your numbers will be inaccurate until the database fills back up again. If you change GPUs, you`
			should probably clear the `generation_time` time column in the `prompts` table.
update home, update readme, calculate estimated wait based on database stats 2023-08-24 16:47:14 -06:00
update readme 2023-08-24 12:19:59 -06:00			`### To Do`

			`- Implement streaming`
			- Add `huggingface/text-generation-inference`
			`- Convince Oobabooga to implement concurrent generation`
log model used in request so we can pull the correct averages when we change models 2023-08-26 00:30:59 -06:00			`- Make sure stats work when starting from an empty database`
calculate weighted average for stat tracking 2023-08-27 19:58:04 -06:00			`- Make sure we're correctly canceling requests when the client cancels`
implement streaming for hf-textgen 2023-08-29 17:56:12 -06:00			`- Implement auth and tokens on the websocket endpoint. Maybe add something to the instruct prompt and the remove it before proxying??`