An HTTP API to serve local LLM Models.
This repository has been archived on 2024-10-27. You can view files and clone it, but cannot push or open issues or pull requests.
Go to file
Cyberes fd09c783d3 refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
config ready for public release 2024-03-18 12:42:44 -06:00
llm_server refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
other refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
templates redo database connection, add pooling, minor logging changes, other clean up 2024-05-07 09:48:51 -06:00
.gitignore actually we don't want to emulate openai 2023-09-12 01:04:11 -06:00
LICENSE Initial commit 2023-08-21 14:40:46 -06:00
README.md ready for public release 2024-03-18 12:42:44 -06:00
daemon.py refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
requirements.txt refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
server.py refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00

README.md

local-llm-server

An HTTP API to serve local LLM Models.

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

Features:

  • Load balancing between a cluster of different VLLM backends.
  • OpenAI-compatible API.
  • Streaming support via websockets (and SSE for the OpenAI endpoint).
  • Descriptive landing page.
  • Logging and insights.
  • Tokens and authentication with a priority system.
  • Moderation system using OpenAI's moderation API.

Install VLLM

The VLLM backend and local-llm-server don't need to be on the same machine.

  1. Create a venv.

  2. Open requirements.txt and find the line that defines VLLM (it looks something like vllm==x.x.x) and copy it.

  3. Install that version of VLLM using pip install vllm==x.x.x

  4. Clone the repo: git clone https://git.evulid.cc/cyberes/local-llm-server.git

  5. Download your model.

  6. Create a user to run the VLLM server.

    sudo adduser vllm --system
    

    Also, make sure the user has access to the necessary files like the models and the venv.

  7. Copy the systemd service file from other/vllm/vllm.service to /etc/systemd/system/ and edit the paths to point to your install location. Then activate the server.

Install

  1. Create a user to run the server:

    sudo adduser server --system
    
  2. mkdir /srv/server

  3. git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server

  4. sudo apt install redis

  5. python3 -m venv venv

  6. ./venv/bin/pip install -r requirements.txt

  7. chown -R server:nogroup /srv/server

  8. Create the logs location:

    sudo mkdir /var/log/localllm
    sudo chown -R server:adm /var/log/localllm/
    
    
  9. Install nginx:

    sudo apt install nginx
    
  10. An example nginx site is provided at other/nginx-site.conf. Copy this to /etc/nginx/default.

  11. Copy the example config from config/config.yml.sample to config/config.yml. Modify the config (it's well commented).

  12. Set up your MySQL server with a database and user according to what you configured in config.yml.

  13. Install the two systemd services in other/ and activate them.

Creating Tokens

You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.

Fields:

  • token: The authentication token. If it starts with SYSTEM__, it's reserved for internal usage.
  • type: The token type. For your reference only, not used by the system (need to confirm this, though).
  • priority: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
  • simultaneous_ip: How many requests from an IP are allowed to be in the queue.
  • openai_moderation_enabled: enable moderation for this token. 1 means enabled, 0 is disabled.
  • uses: How many times this token has been used. Set it to 0 and don't touch it.
  • max_uses: How many times this token is allowed to be used. Set to NULL to disable restriction and allow infinite uses.
  • expire: When the token expires and will no longer be allowed. A Unix timestamp.
  • disabled: Set the token to be disabled.

Updating VLLM

This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the SamplingParams object in llm_server/llm/vllm/vllm_backend.py match up with those in VLLM's vllm/sampling_params.py.

Additionally, make sure our VLLM API server at other/vllm/vllm_api_server.py matches vllm/entrypoints/api_server.py.

Then, update the VLLM version in requirements.txt.

To Do

  • Support the Oobabooga Text Generation WebUI as a backend
  • Make the moderation apply to the non-OpenAI endpoints as well
  • Make sure stats work when starting from an empty database
  • Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
  • Add test to verify the OpenAI endpoint works as expected