An HTTP API to serve local LLM Models.
Go to file
Cyberes 0059e7956c Merge cluster to master (#3)
Co-authored-by: Cyberes <cyberes@evulid.cc>
Reviewed-on: #3
2023-10-27 19:19:22 -06:00
config add moderation endpoint to openai api, update config 2023-09-14 15:07:17 -06:00
llm_server Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
other Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
templates Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
.gitignore actually we don't want to emulate openai 2023-09-12 01:04:11 -06:00
LICENSE Initial commit 2023-08-21 14:40:46 -06:00
README.md Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
VLLM INSTALL.md adjust logging, add more vllm stuff 2023-09-13 11:22:33 -06:00
daemon.py Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
requirements.txt Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00
server.py Merge cluster to master (#3) 2023-10-27 19:19:22 -06:00

README.md

local-llm-server

An HTTP API to serve local LLM Models.

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

Install

  1. sudo apt install redis
  2. python3 -m venv venv
  3. source venv/bin/activate
  4. pip install -r requirements.txt
  5. wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl
  6. python3 server.py

An example systemctl service file is provided in other/local-llm.service.

Configure

First, set up your LLM backend. Currently, only oobabooga/text-generation-webui is supported, but eventually huggingface/text-generation-inference will be the default.

Then, configure this server. The config file is located at config/config.yml.sample so copy it to config/config.yml.

  1. Set backend_url to the base API URL of your backend.
  2. Set token_limit to the configured token limit of the backend. This number is shown to clients and on the home page.

To set up token auth, add rows to the token_auth table in the SQLite database.

token: the token/password.

type: the type of token. Currently unused (maybe for a future web interface?) but required.

priority: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

uses: how many responses this token has generated. Leave empty.

max_uses: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

expire: UNIX timestamp of when this token expires and is not longer valid.

disabled: mark the token as disabled.

Use

If you see unexpected errors in the console, make sure daemon.py is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.

Flask may give unusual errors when running python server.py. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: gunicorn -b :5000 --worker-class gevent server:app

To Do

  • Implement streaming
  • Bring streaming endpoint up to the level of the blocking endpoint
  • Add VLLM support
  • Make sure stats work when starting from an empty database
  • Make sure we're correctly canceling requests when the client cancels
  • Make sure the OpenAI endpoint works as expected