57 lines
2.6 KiB
Markdown
57 lines
2.6 KiB
Markdown
# local-llm-server
|
|
|
|
_An HTTP API to serve local LLM Models._
|
|
|
|
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
|
|
|
|
### Install
|
|
|
|
1. `sudo apt install redis`
|
|
2. `python3 -m venv venv`
|
|
3. `source venv/bin/activate`
|
|
4. `pip install -r requirements.txt`
|
|
5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
|
|
6. `python3 server.py`
|
|
|
|
An example systemctl service file is provided in `other/local-llm.service`.
|
|
|
|
### Configure
|
|
|
|
First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but
|
|
eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.
|
|
|
|
Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.
|
|
|
|
1. Set `backend_url` to the base API URL of your backend.
|
|
2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.
|
|
|
|
To set up token auth, add rows to the `token_auth` table in the SQLite database.
|
|
|
|
`token`: the token/password.
|
|
|
|
`type`: the type of token. Currently unused (maybe for a future web interface?) but required.
|
|
|
|
`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.
|
|
|
|
`uses`: how many responses this token has generated. Leave empty.
|
|
|
|
`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.
|
|
|
|
`expire`: UNIX timestamp of when this token expires and is not longer valid.
|
|
|
|
`disabled`: mark the token as disabled.
|
|
|
|
### Use
|
|
|
|
If you see unexpected errors in the console, make sure `daemon.py` is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.
|
|
|
|
Flask may give unusual errors when running `python server.py`. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: `gunicorn -b :5000 --worker-class gevent server:app`
|
|
|
|
### To Do
|
|
|
|
- [x] Implement streaming
|
|
- [ ] Bring streaming endpoint up to the level of the blocking endpoint
|
|
- [x] Add VLLM support
|
|
- [ ] Make sure stats work when starting from an empty database
|
|
- [ ] Make sure we're correctly canceling requests when the client cancels
|
|
- [ ] Make sure the OpenAI endpoint works as expected |