An HTTP API to serve local LLM Models.

This repository has been archived on 2024-10-27. You can view files and clone it, but cannot push or open issues or pull requests.

Go to file

Cyberes cb99c3490e rewrite tokenizer, restructure validation		2023-09-24 13:02:30 -06:00
config	add moderation endpoint to openai api, update config	2023-09-14 15:07:17 -06:00
llm_server	rewrite tokenizer, restructure validation	2023-09-24 13:02:30 -06:00
other	rewrite tokenizer, restructure validation	2023-09-24 13:02:30 -06:00
templates	fix division by 0, prettify /stats json, add js var to home	2023-09-16 17:37:43 -06:00
.gitignore	actually we don't want to emulate openai	2023-09-12 01:04:11 -06:00
LICENSE	Initial commit	2023-08-21 14:40:46 -06:00
README.md	adjust some things	2023-09-12 01:10:58 -06:00
VLLM INSTALL.md	adjust logging, add more vllm stuff	2023-09-13 11:22:33 -06:00
requirements.txt	port to mysql, use vllm tokenizer endpoint	2023-09-20 20:30:31 -06:00
server.py	rewrite tokenizer, restructure validation	2023-09-24 13:02:30 -06:00

README.md

local-llm-server

An HTTP API to serve local LLM Models.

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to make changes to (or even switch) your backend without affecting your clients.

Install

sudo apt install redis
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl
python3 server.py

An example systemctl service file is provided in other/local-llm.service.

Configure

First, set up your LLM backend. Currently, only oobabooga/text-generation-webui is supported, but eventually huggingface/text-generation-inference will be the default.

Then, configure this server. The config file is located at config/config.yml.sample so copy it to config/config.yml.

Set backend_url to the base API URL of your backend.
Set token_limit to the configured token limit of the backend. This number is shown to clients and on the home page.

To set up token auth, add rows to the token_auth table in the SQLite database.

token: the token/password.

type: the type of token. Currently unused (maybe for a future web interface?) but required.

priority: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

uses: how many responses this token has generated. Leave empty.

max_uses: how many responses this token is allowed to generate. Leave empty to leave unrestricted.

expire: UNIX timestamp of when this token expires and is not longer valid.

disabled: mark the token as disabled.

Use

DO NOT lose your database. It's used for calculating the estimated wait time based on average TPS and response tokens and if you lose those stats your numbers will be inaccurate until the database fills back up again. If you change GPUs, you should probably clear the generation_time time column in the prompts table.

To Do

Implement streaming
Add huggingface/text-generation-inference
Convince Oobabooga to implement concurrent generation
Make sure stats work when starting from an empty database
Make sure we're correctly canceling requests when the client cancels
Implement auth and tokens on the websocket endpoint. Maybe add something to the instruct prompt and the remove it before proxying??