ready for public release

This commit is contained in:
Cyberes 2024-03-18 12:42:44 -06:00
parent e21be17d9b
commit ab408c6c5b
22 changed files with 214 additions and 202 deletions

119
README.md
View File

@ -4,57 +4,110 @@ _An HTTP API to serve local LLM Models._
The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.
### Install
also need to create /var/log/localllm
chown -R server:adm /var/log/localllm/
1. `sudo apt install redis`
2. `python3 -m venv venv`
3. `source venv/bin/activate`
4. `pip install -r requirements.txt`
5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
6. `python3 server.py`
**Features:**
An example systemctl service file is provided in `other/local-llm.service`.
- Load balancing between a cluster of different VLLM backends.
- OpenAI-compatible API.
- Streaming support via websockets (and SSE for the OpenAI endpoint).
- Descriptive landing page.
- Logging and insights.
- Tokens and authentication with a priority system.
- Moderation system using OpenAI's moderation API.
### Configure
First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but
eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.
Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.
## Install VLLM
1. Set `backend_url` to the base API URL of your backend.
2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.
The VLLM backend and local-llm-server don't need to be on the same machine.
To set up token auth, add rows to the `token_auth` table in the SQLite database.
1. Create a venv.
2. Open `requirements.txt` and find the line that defines VLLM (it looks something like `vllm==x.x.x`) and copy it.
3. Install that version of VLLM using `pip install vllm==x.x.x`
4. Clone the repo: `git clone https://git.evulid.cc/cyberes/local-llm-server.git`
5. Download your model.
6. Create a user to run the VLLM server.
```shell
sudo adduser vllm --system
```
`token`: the token/password.
Also, make sure the user has access to the necessary files like the models and the venv.
`type`: the type of token. Currently unused (maybe for a future web interface?) but required.
7. Copy the systemd service file from `other/vllm/vllm.service` to `/etc/systemd/system/` and edit the paths to point to your install location. Then activate the server.
`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.
`uses`: how many responses this token has generated. Leave empty.
`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.
## Install
`expire`: UNIX timestamp of when this token expires and is not longer valid.
1. Create a user to run the server:
```shell
sudo adduser server --system
```
`disabled`: mark the token as disabled.
2. `mkdir /srv/server`
### Use
3. `git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server`
If you see unexpected errors in the console, make sure `daemon.py` is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.
4. `sudo apt install redis`
Flask may give unusual errors when running `python server.py`. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: `gunicorn -b :5000 --worker-class gevent server:app`
5. `python3 -m venv venv`
### To Do
6. `./venv/bin/pip install -r requirements.txt`
- [x] Implement streaming
- [ ] Bring streaming endpoint up to the level of the blocking endpoint
- [x] Add VLLM support
7. `chown -R server:nogroup /srv/server`
8. Create the logs location:
```shell
sudo mkdir /var/log/localllm
sudo chown -R server:adm /var/log/localllm/
9. Install nginx:
```shell
sudo apt install nginx
```
10. An example nginx site is provided at `other/nginx-site.conf`. Copy this to `/etc/nginx/default`.
11. Copy the example config from `config/config.yml.sample` to `config/config.yml`. Modify the config (it's well commented).
12. Set up your MySQL server with a database and user according to what you configured in `config.yml`.
13. Install the two systemd services in `other/` and activate them.
## Creating Tokens
You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.
**Fields:**
- `token`: The authentication token. If it starts with `SYSTEM__`, it's reserved for internal usage.
- `type`: The token type. For your reference only, not used by the system (need to confirm this, though).
- `priority`: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
- `simultaneous_ip`: How many requests from an IP are allowed to be in the queue.
- `openai_moderation_enabled`: enable moderation for this token. `1` means enabled, `0` is disabled.
- `uses`: How many times this token has been used. Set it to `0` and don't touch it.
- `max_uses`: How many times this token is allowed to be used. Set to `NULL` to disable restriction and allow infinite uses.
- `expire`: When the token expires and will no longer be allowed. A Unix timestamp.
- `disabled`: Set the token to be disabled.
## Updating VLLM
This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the `SamplingParams` object in [llm_server/llm/vllm/vllm_backend.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/llm_server/llm/vllm/vllm_backend.py) match up with those in VLLM's [vllm/sampling_params.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/sampling_params.py).
Additionally, make sure our VLLM API server at [other/vllm/vllm_api_server.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/other/vllm/vllm_api_server.py) matches [vllm/entrypoints/api_server.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/entrypoints/api_server.py).
Then, update the VLLM version in `requirements.txt`.
## To Do
- [ ] Support the Oobabooga Text Generation WebUI as a backend
- [ ] Make the moderation apply to the non-OpenAI endpoints as well
- [ ] Make sure stats work when starting from an empty database
- [ ] Make sure we're correctly canceling requests when the client cancels
- [ ] Make sure the OpenAI endpoint works as expected
- [ ] Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
- [ ] Add test to verify the OpenAI endpoint works as expected

View File

@ -1,4 +0,0 @@
```bash
wget https://git.evulid.cc/attachments/6e7bfc04-cad4-4494-a98d-1391fbb402d3 -O /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl && pip install /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl && rm /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl
pip install auto_gptq
```

View File

@ -1,80 +1,157 @@
## Important
## Main ##
backend_url: https://10.0.0.50:8283
frontend_api_mode: ooba
mode: vllm
concurrent_gens: 3
token_limit: 8192
cluster:
- backend_url: http://1.2.3.4:7000
concurrent_gens: 3
mode: vllm
# higher priority number means that if lower-number priority backends fail,
# the proxy will fall back to backends that have greater priority numbers.
priority: 16
- backend_url: http://4.5.6.7:9107
concurrent_gens: 3
mode: vllm
priority: 10
- backend_url: http://7.8.9.0:9208
concurrent_gens: 3
mode: vllm
priority: 10
# If enabled, the "priority" of the backends will be ignored
# and will be prioritized by the estimated parameter count instead.
# For example, a 70b model will be a higher priority than a 13b.
prioritize_by_size: true
# The token used to access various administration endpoints.
admin_token: password1234567
# How many requests a single IP is allowed to put in the queue.
# If an IP tries to put more than this their request will be rejected
# until the other(s) are completed.
simultaneous_requests_per_ip: 2
simultaneous_requests_per_ip: 1
## Optional
# The connection details for your MySQL database.
mysql:
host: 127.0.0.1
username: localllm
password: 'password1234'
database: localllm
max_new_tokens: 500
# Manually set the HTTP host shown to the clients.
# Comment out to auto-detect.
# http_host: https://example.com
enable_streaming: false
# Where the server will write its logs to.
webserver_log_directory: /var/log/localllm
log_prompts: false
verify_ssl: false # Python request has issues with self-signed certs
## Optional ##
auth_required: false
# Include SYSTEM tokens in the stats calculation.
# Applies to average_generation_elapsed_sec and estimated_avg_tps.
include_system_tokens_in_stats: true
max_queued_prompts_per_ip: 1
# Run a background thread to cache the homepage. The homepage has to load
# a lot of data so it's good to keep it cached. The thread will call whatever
# the base API url.
background_homepage_cacher: true
# The maximum amount of tokens a client is allowed to generate.
max_new_tokens: 500
# Enable/disable streaming.
enable_streaming: true
# Show the backends that the server is configured to use. Disable this to hide them on the public homepage.
show_backends: true
# Log all prompt inputs and outputs.
log_prompts: false
# Disable the verification of SSL certificates in all HTTP requests made by the server.
verify_ssl: false
# Require a valid API key for all inference requests.
auth_required: false
# Name of your proxy, shown to clients.
llm_middleware_name: local-llm-server
llm_middleware_name: proxy.example.co
# Set the name of the model shown to clients
# manual_model_name: testing123
# Override the name of the model shown to clients. Comment out to auto-detect.
# manual_model_name: testing123
# JS tracking code to add to the home page.
# analytics_tracking_code: |
# alert("hello");
# analytics_tracking_code: |
# var test = 123;
# alert(test);
# HTML to add under the "Estimated Wait Time" line.
# info_html: |
# bla bla whatever
info_html: |
If you are having issues with ratelimiting, try using streaming.
enable_openi_compatible_backend: true
# openai_api_key:
expose_openai_system_prompt: true
#openai_system_prompt: |
# Enable/disable the OpenAI-compatible endpoint.
enable_openi_compatible_backend: true
# Your OpenAI API key. Only used for the moderation API and fetching data.
openai_api_key: sk-123456
# Enable/disable the endpoint that shows the system prompt sent to the AI when calling the OpenAI-compatible endpoint.
expose_openai_system_prompt: true
# Should we show our model in the OpenAI API or simulate it? If false, make sure you set
# openai_api_key since the actual OpenAI models response will be cloned.
openai_expose_our_model: false
# Add the string "###" to the stop string to prevent the AI from trying to speak as other characters.
openai_force_no_hashes: true
# Enable moderating requests via OpenAI's moderation endpoint.
openai_moderation_enabled: true
# Don't wait longer than this many seconds for the moderation request
# to OpenAI to complete.
openai_moderation_timeout: 5
# Send the last N messages in an OpenAI request to the moderation endpoint.
openai_moderation_scan_last_n: 5
# The organization name to tell the LLM on the OpenAI endpoint so it can better simulate OpenAI's response.
openai_org_name: OpenAI
# Silently trim prompts to the OpenAI endpoint to fit the model's length.
openai_silent_trim: true
# Set the system prompt for the OpenAI-compatible endpoint. Comment out to use the default.
#openai_system_prompt: |
# You are an assistant chatbot. Your main function is to provide accurate and helpful responses to the user's queries. You should always be polite, respectful, and patient. You should not provide any personal opinions or advice unless specifically asked by the user. You should not make any assumptions about the user's knowledge or abilities. You should always strive to provide clear and concise answers. If you do not understand a user's query, ask for clarification. If you cannot provide an answer, apologize and suggest the user seek help elsewhere.\nLines that start with "### ASSISTANT" were messages you sent previously.\nLines that start with "### USER" were messages sent by the user you are chatting with.\nYou will respond to the "### RESPONSE:" prompt as the assistant and follow the instructions given by the user.\n\n
### Tuneables ##
# Path that is shown to users for them to connect to
# TODO: set this based on mode. Instead, have this be the path to the API
frontend_api_client: /api
# Path to the database, relative to the directory of server.py
database_path: ./proxy-server.db
frontend_api_client: /api
# How to calculate the average generation time.
# Valid options: database, minute
# Valid options: database, minute
# "database" calculates average from historical data in the database, with the more recent data weighted more.
# "minute" calculates it from the last minute of data.
average_generation_time_mode: database
average_generation_time_mode: database
## STATS ##
# These options control what is shown on the stats endpoint.
# Display the total_proompts item on the stats screen.
show_num_prompts: true
# Display the uptime item on the stats screen.
show_uptime: true
show_total_output_tokens: true
show_backend_info: true
show_num_prompts: true
# Load the number of prompts from the database to display on the stats page.
load_num_prompts: true
# If enabled, count all prompts in the database. If disabled, only count the prompts since the server started.
load_num_prompts: true
## NETDATA ##
# Display the uptime item on the stats screen.
show_uptime: true
netdata_root: http://10.0.0.50:19999
# Display the total number of tokens generated.
show_total_output_tokens: true

View File

@ -2,7 +2,6 @@ import yaml
config_default_vars = {
'log_prompts': False,
'database_path': './proxy-server.db',
'auth_required': False,
'frontend_api_client': '',
'verify_ssl': True,
@ -14,7 +13,6 @@ config_default_vars = {
'info_html': None,
'show_total_output_tokens': True,
'simultaneous_requests_per_ip': 3,
'show_backend_info': True,
'max_new_tokens': 500,
'manual_model_name': False,
'enable_streaming': True,
@ -24,7 +22,7 @@ config_default_vars = {
'openai_system_prompt': """You are an assistant chatbot. Your main function is to provide accurate and helpful responses to the user's queries. You should always be polite, respectful, and patient. You should not provide any personal opinions or advice unless specifically asked by the user. You should not make any assumptions about the user's knowledge or abilities. You should always strive to provide clear and concise answers. If you do not understand a user's query, ask for clarification. If you cannot provide an answer, apologize and suggest the user seek help elsewhere.\nLines that start with "### ASSISTANT" were messages you sent previously.\nLines that start with "### USER" were messages sent by the user you are chatting with.\nYou will respond to the "### RESPONSE:" prompt as the assistant and follow the instructions given by the user.\n\n""",
'http_host': None,
'admin_token': None,
'openai_epose_our_model': False,
'openai_expose_our_model': False,
'openai_force_no_hashes': True,
'include_system_tokens_in_stats': True,
'openai_moderation_scan_last_n': 5,

View File

@ -28,7 +28,6 @@ def load_config(config_path):
opts.show_total_output_tokens = config['show_total_output_tokens']
opts.netdata_root = config['netdata_root']
opts.simultaneous_requests_per_ip = config['simultaneous_requests_per_ip']
opts.show_backend_info = config['show_backend_info']
opts.max_new_tokens = config['max_new_tokens']
opts.manual_model_name = config['manual_model_name']
opts.llm_middleware_name = config['llm_middleware_name']
@ -39,7 +38,7 @@ def load_config(config_path):
opts.openai_api_key = config['openai_api_key']
openai.api_key = opts.openai_api_key
opts.admin_token = config['admin_token']
opts.openai_expose_our_model = config['openai_epose_our_model']
opts.openai_expose_our_model = config['openai_expose_our_model']
opts.openai_force_no_hashes = config['openai_force_no_hashes']
opts.include_system_tokens_in_stats = config['include_system_tokens_in_stats']
opts.openai_moderation_scan_last_n = config['openai_moderation_scan_last_n']
@ -59,13 +58,12 @@ def load_config(config_path):
llm_server.routes.queue.priority_queue = PriorityQueue([x['backend_url'] for x in config['cluster']])
if opts.openai_expose_our_model and not opts.openai_api_key:
print('If you set openai_epose_our_model to false, you must set your OpenAI key in openai_api_key.')
print('If you set openai_expose_our_model to false, you must set your OpenAI key in openai_api_key.')
sys.exit(1)
opts.verify_ssl = config['verify_ssl']
if not opts.verify_ssl:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
if config['http_host']:

View File

@ -17,7 +17,6 @@ average_generation_time_mode = 'database'
show_total_output_tokens = True
netdata_root = None
simultaneous_requests_per_ip = 3
show_backend_info = True
manual_model_name = None
llm_middleware_name = ''
enable_openi_compatible_backend = True

View File

@ -5,7 +5,6 @@ After=basic.target network.target
[Service]
User=server
Group=server
ExecStart=/srv/server/local-llm-server/venv/bin/python /srv/server/local-llm-server/daemon.py
Restart=always
RestartSec=2
@ -13,3 +12,4 @@ SyslogIdentifier=local-llm-daemon
[Install]
WantedBy=multi-user.target

View File

@ -6,7 +6,6 @@ Requires=local-llm-daemon.service
[Service]
User=server
Group=server
WorkingDirectory=/srv/server/local-llm-server
# Sometimes the old processes aren't terminated when the service is restarted.
@ -21,3 +20,4 @@ SyslogIdentifier=local-llm-server
[Install]
WantedBy=multi-user.target

View File

@ -1,38 +0,0 @@
#!/bin/bash
# Expected to be run as root in some sort of container
cd /tmp || exit
if [ ! -d /tmp/vllm-gptq ]; then
git clone https://github.com/chu-tianxiang/vllm-gptq.git
cd vllm-gptq || exit
else
cd vllm-gptq || exit
git pull
fi
if [ ! -d /root/miniconda3 ]; then
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3-latest-Linux-x86_64.sh
bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b
rm /tmp/Miniconda3-latest-Linux-x86_64.sh
fi
eval "$(/root/miniconda3/bin/conda shell.bash hook)"
if [ ! -d /root/miniconda3/envs/vllm-gptq ]; then
conda create --name vllm-gptq -c conda-forge python=3.11 -y
conda activate vllm-gptq
pip install ninja
conda install -y -c "nvidia/label/cuda-11.8.0" cuda==11.8
conda install -y cudatoolkit cudnn
else
conda activate vllm-gptq
fi
pip install -r requirements.txt
CUDA_HOME=/root/miniconda3/envs/vllm-gptq python setup.py bdist_wheel
echo -e "\n\n===\nOUTPUT:"
find /tmp/vllm-gptq -name '*.whl'

View File

@ -1,70 +0,0 @@
import io
import os
import re
from typing import List
import setuptools
from torch.utils.cpp_extension import BuildExtension
ROOT_DIR = os.path.dirname(__file__)
"""
Build vllm-gptq without any CUDA
"""
def get_path(*filepath) -> str:
return os.path.join(ROOT_DIR, *filepath)
def find_version(filepath: str):
"""Extract version information from the given filepath.
Adapted from https://github.com/ray-project/ray/blob/0b190ee1160eeca9796bc091e07eaebf4c85b511/python/setup.py
"""
with open(filepath) as fp:
version_match = re.search(
r"^__version__ = ['\"]([^'\"]*)['\"]", fp.read(), re.M)
if version_match:
return version_match.group(1)
raise RuntimeError("Unable to find version string.")
def read_readme() -> str:
"""Read the README file."""
return io.open(get_path("README.md"), "r", encoding="utf-8").read()
def get_requirements() -> List[str]:
"""Get Python package dependencies from requirements.txt."""
with open(get_path("requirements.txt")) as f:
requirements = f.read().strip().split("\n")
return requirements
setuptools.setup(
name="vllm-gptq",
version=find_version(get_path("", "__init__.py")),
author="vLLM Team",
license="Apache 2.0",
description="A high-throughput and memory-efficient inference and serving engine for LLMs",
long_description=read_readme(),
long_description_content_type="text/markdown",
url="https://github.com/vllm-project/vllm",
project_urls={
"Homepage": "https://github.com/vllm-project/vllm",
"Documentation": "https://vllm.readthedocs.io/en/latest/",
},
classifiers=[
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"License :: OSI Approved :: Apache Software License",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
packages=setuptools.find_packages(
exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
python_requires=">=3.8",
install_requires=get_requirements(),
cmdclass={"build_ext": BuildExtension},
)

View File

@ -4,12 +4,11 @@ Wants=basic.target
After=basic.target network.target
[Service]
User=USERNAME
Group=USERNAME
# Can add --disable-log-requests when I know the backend won't crash
ExecStart=/storage/vllm/venv/bin/python /storage/vllm/api_server.py --model /storage/oobabooga/one-click-installers/text-generation-webui/models/TheBloke_MythoMax-L2-13B-GPTQ/ --host 0.0.0.0 --port 7000 --max-num-batched-tokens 24576
User=vllm
ExecStart=/storage/vllm/vllm-venv/bin/python3.11 /storage/vllm/api_server.py --model /storage/models/awq/MythoMax-L2-13B-AWQ --quantization awq --host 0.0.0.0 --port 7000 --gpu-memory-utilization 0.95 --max-log-len 100
Restart=always
RestartSec=2
[Install]
WantedBy=multi-user.target

View File

@ -15,4 +15,4 @@ redis==5.0.1
ujson==5.8.0
vllm==0.2.7
gradio~=3.46.1
coloredlogs~=15.0.1
coloredlogs~=15.0.1