An HTTP API to serve local LLM Models.
This repository has been archived on 2024-10-27. You can view files and clone it, but cannot push or open issues or pull requests.
Go to file
Cyberes 3494106355 Update README.md 2024-10-27 12:14:57 -06:00
config get functional again 2024-07-07 15:05:35 -06:00
llm_server get functional again 2024-07-07 15:05:35 -06:00
other get functional again 2024-07-07 15:05:35 -06:00
templates redo database connection, add pooling, minor logging changes, other clean up 2024-05-07 09:48:51 -06:00
.gitignore actually we don't want to emulate openai 2023-09-12 01:04:11 -06:00
LICENSE Initial commit 2023-08-21 14:40:46 -06:00
README.md Update README.md 2024-10-27 12:14:57 -06:00
daemon.py refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
requirements.txt refactor a lot of things, major cleanup, use postgresql 2024-05-07 17:03:41 -06:00
server.py archive project 2024-10-27 12:13:26 -06:00

README.md

local-llm-server

An HTTP API to serve local LLM Models.

ARCHIVED PROJECT: this project was created before any good solution existed for managing LLM endpoints and has now been superseded by many good options. LiteLLM is the best replacement. If a need for an un-authenticated public connection to SillyTavern arises, check out cyberes/litellm-public.

The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

Features:

  • Load balancing between a cluster of different VLLM backends.
  • OpenAI-compatible API.
  • Streaming support via websockets (and SSE for the OpenAI endpoint).
  • Descriptive landing page.
  • Logging and insights.
  • Tokens and authentication with a priority system.
  • Moderation system using OpenAI's moderation API.

Install VLLM

The VLLM backend and local-llm-server don't need to be on the same machine.

  1. Create a venv.

  2. Open requirements.txt and find the line that defines VLLM (it looks something like vllm==x.x.x) and copy it.

  3. Install that version of VLLM using pip install vllm==x.x.x

  4. Clone the repo: git clone https://git.evulid.cc/cyberes/local-llm-server.git

  5. Download your model.

  6. Create a user to run the VLLM server.

    sudo adduser vllm --system
    

    Also, make sure the user has access to the necessary files like the models and the venv.

  7. Copy the systemd service file from other/vllm/vllm.service to /etc/systemd/system/ and edit the paths to point to your install location. Then activate the server.

Install

  1. Create a user to run the server:

    sudo adduser server --system
    
  2. mkdir /srv/server

  3. git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server

  4. sudo apt install redis

  5. python3 -m venv venv

  6. ./venv/bin/pip install -r requirements.txt

  7. chown -R server:nogroup /srv/server

  8. Create the logs location:

    sudo mkdir /var/log/localllm
    sudo chown -R server:adm /var/log/localllm/
    
    
  9. Install nginx:

    sudo apt install nginx
    
  10. An example nginx site is provided at other/nginx-site.conf. Copy this to /etc/nginx/default.

  11. Copy the example config from config/config.yml.sample to config/config.yml. Modify the config (it's well commented).

  12. Set up your MySQL server with a database and user according to what you configured in config.yml.

  13. Install the two systemd services in other/ and activate them.

Creating Tokens

You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.

Fields:

  • token: The authentication token. If it starts with SYSTEM__, it's reserved for internal usage.
  • type: The token type. For your reference only, not used by the system (need to confirm this, though).
  • priority: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
  • simultaneous_ip: How many requests from an IP are allowed to be in the queue.
  • openai_moderation_enabled: enable moderation for this token. 1 means enabled, 0 is disabled.
  • uses: How many times this token has been used. Set it to 0 and don't touch it.
  • max_uses: How many times this token is allowed to be used. Set to NULL to disable restriction and allow infinite uses.
  • expire: When the token expires and will no longer be allowed. A Unix timestamp.
  • disabled: Set the token to be disabled.

Updating VLLM

This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the SamplingParams object in llm_server/llm/vllm/vllm_backend.py match up with those in VLLM's vllm/sampling_params.py.

Additionally, make sure our VLLM API server at other/vllm/vllm_api_server.py matches vllm/entrypoints/api_server.py.

Then, update the VLLM version in requirements.txt.

To Do

  • Support the Oobabooga Text Generation WebUI as a backend
  • Make the moderation apply to the non-OpenAI endpoints as well
  • Make sure stats work when starting from an empty database
  • Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
  • Add test to verify the OpenAI endpoint works as expected
  • Document the Llm-Disable-Openai header