2023-08-21 14:40:46 -06:00
# local-llm-server
2023-08-24 17:55:55 -06:00
_An HTTP API to serve local LLM Models._
2023-08-21 23:59:50 -06:00
2023-09-24 15:54:35 -06:00
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
2023-08-23 23:48:46 -06:00
### Install
1. `sudo apt install redis`
2. `python3 -m venv venv`
3. `source venv/bin/activate`
4. `pip install -r requirements.txt`
2023-09-12 01:10:58 -06:00
5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
2023-09-12 01:04:11 -06:00
6. `python3 server.py`
2023-08-23 23:48:46 -06:00
An example systemctl service file is provided in `other/local-llm.service` .
### Configure
2023-08-24 16:47:14 -06:00
First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui ](https://github.com/oobabooga/text-generation-webui ) is supported, but
eventually [huggingface/text-generation-inference ](https://github.com/huggingface/text-generation-inference ) will be the default.
2023-08-23 23:48:46 -06:00
2023-08-24 00:09:57 -06:00
Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml` .
2023-08-23 23:48:46 -06:00
1. Set `backend_url` to the base API URL of your backend.
2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.
To set up token auth, add rows to the `token_auth` table in the SQLite database.
`token` : the token/password.
`type` : the type of token. Currently unused (maybe for a future web interface?) but required.
`priority` : the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.
`uses` : how many responses this token has generated. Leave empty.
`max_uses` : how many responses this token is allowed to generate. Leave empty to leave unrestricted.
`expire` : UNIX timestamp of when this token expires and is not longer valid.
2023-08-24 12:19:59 -06:00
`disabled` : mark the token as disabled.
2023-08-24 16:47:14 -06:00
### Use
2023-09-24 15:54:35 -06:00
2023-08-24 16:47:14 -06:00
2023-08-24 12:19:59 -06:00
### To Do
2023-09-24 15:54:35 -06:00
- [x] Implement streaming
- [ ] Bring streaming endpoint up to the level of the blocking endpoint
- [x] Add VLLM support
- [ ] Make sure stats work when starting from an empty database
- [ ] Make sure we're correctly canceling requests when the client cancels
- [ ] Make sure the OpenAI endpoint works as expected