ready for public release

2024-03-18 12:42:44 -06:00 · 2024-03-18 12:42:44 -06:00 · ab408c6c5b
parent e21be17d9b
commit ab408c6c5b
22 changed files with 214 additions and 202 deletions
--- a/README.md
+++ b/README.md
@ -4,57 +4,110 @@ _An HTTP API to serve local LLM Models._

 The purpose of this server is to abstract your LLM backend from your frontend API. This enables you to switch your backend while providing a stable frontend clients.

-### Install

-also need to create /var/log/localllm
-chown -R server:adm /var/log/localllm/

-1. `sudo apt install redis`
-2. `python3 -m venv venv`
-3. `source venv/bin/activate`
-4. `pip install -r requirements.txt`
-5. `wget https://git.evulid.cc/attachments/89c87201-58b1-4e28-b8fd-d0b323c810c4 -O /tmp/vllm_gptq-0.1.3-py3-none-any.whl && pip install /tmp/vllm_gptq-0.1.3-py3-none-any.whl && rm /tmp/vllm_gptq-0.1.3-py3-none-any.whl`
-6. `python3 server.py`
+**Features:**

-An example systemctl service file is provided in `other/local-llm.service`.
+- Load balancing between a cluster of different VLLM backends.
+- OpenAI-compatible API.
+- Streaming support via websockets (and SSE for the OpenAI endpoint).
+- Descriptive landing page.
+- Logging and insights.
+- Tokens and authentication with a priority system.
+- Moderation system using OpenAI's moderation API.

-### Configure

-First, set up your LLM backend. Currently, only [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) is supported, but
-eventually [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) will be the default.

-Then, configure this server. The config file is located at `config/config.yml.sample` so copy it to `config/config.yml`.
+## Install VLLM

-1. Set `backend_url` to the base API URL of your backend.
-2. Set `token_limit` to the configured token limit of the backend. This number is shown to clients and on the home page.
+The VLLM backend and local-llm-server don't need to be on the same machine.

-To set up token auth, add rows to the `token_auth` table in the SQLite database.
+1. Create a venv.
+2. Open `requirements.txt` and find the line that defines VLLM (it looks something like `vllm==x.x.x`) and copy it.
+3. Install that version of VLLM using `pip install vllm==x.x.x`
+4. Clone the repo: `git clone https://git.evulid.cc/cyberes/local-llm-server.git`
+5. Download your model.
+6. Create a user to run the VLLM server.
+   ```shell
+   sudo adduser vllm --system
+   ```

-`token`: the token/password.
+   Also, make sure the user has access to the necessary files like the models and the venv.

-`type`: the type of token. Currently unused (maybe for a future web interface?) but required.
+7. Copy the systemd service file from `other/vllm/vllm.service` to `/etc/systemd/system/` and edit the paths to point to your install location. Then activate the server.

-`priority`: the lower this value, the higher the priority. Higher priority tokens are bumped up in the queue line.

-`uses`: how many responses this token has generated. Leave empty.

-`max_uses`: how many responses this token is allowed to generate. Leave empty to leave unrestricted.
+## Install

-`expire`: UNIX timestamp of when this token expires and is not longer valid.
+1. Create a user to run the server:
+    ```shell
+    sudo adduser server --system
+    ```

-`disabled`: mark the token as disabled.
+2. `mkdir /srv/server`

-### Use
+3. `git clone https://git.evulid.cc/cyberes/local-llm-server.git /srv/server/local-llm-server`

-If you see unexpected errors in the console, make sure `daemon.py` is running or else the required data will be missing from Redis. You may need to wait a few minutes for the daemon to populate the database.
+4. `sudo apt install redis`

-Flask may give unusual errors when running `python server.py`. I think this is coming from Flask-Socket. Running with Gunicorn seems to fix the issue: `gunicorn -b :5000 --worker-class gevent server:app`
+5. `python3 -m venv venv`

-### To Do
+6. `./venv/bin/pip install -r requirements.txt`

- [x] Implement streaming
- [ ] Bring streaming endpoint up to the level of the blocking endpoint
- [x] Add VLLM support
+7. `chown -R server:nogroup /srv/server`
+
+8. Create the logs location:
+    ```shell
+    sudo mkdir /var/log/localllm
+    sudo chown -R server:adm /var/log/localllm/
+
+9. Install nginx:
+   ```shell
+   sudo apt install nginx
+   ```
+
+10. An example nginx site is provided at `other/nginx-site.conf`. Copy this to `/etc/nginx/default`.
+11. Copy the example config from `config/config.yml.sample` to `config/config.yml`. Modify the config (it's well commented).
+12. Set up your MySQL server with a database and user according to what you configured in `config.yml`.
+13. Install the two systemd services in `other/` and activate them.
+
+
+
+## Creating Tokens
+
+You'll have to execute SQL queries to add tokens. phpMyAdmin makes this easy.
+
+
+
+**Fields:**
+
+- `token`: The authentication token. If it starts with `SYSTEM__`, it's reserved for internal usage.
+- `type`: The token type. For your reference only, not used by the system (need to confirm this, though).
+- `priority`: The priority of the token. Higher priority tokens are bumped up in the queue according to their priority.
+- `simultaneous_ip`: How many requests from an IP are allowed to be in the queue.
+- `openai_moderation_enabled`: enable moderation for this token. `1` means enabled, `0` is disabled.
+- `uses`: How many times this token has been used. Set it to `0` and don't touch it.
+- `max_uses`: How many times this token is allowed to be used. Set to `NULL` to disable restriction and allow infinite uses.
+- `expire`: When the token expires and will no longer be allowed. A Unix timestamp.
+- `disabled`: Set the token to be disabled.
+
+
+
+## Updating VLLM
+
+This project is linked to a specific VLLM version due to a dependency on the parameters. When updating, make sure the parameters in the `SamplingParams` object in [llm_server/llm/vllm/vllm_backend.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/llm_server/llm/vllm/vllm_backend.py) match up with those in VLLM's [vllm/sampling_params.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/sampling_params.py).
+
+Additionally, make sure our VLLM API server at [other/vllm/vllm_api_server.py](https://git.evulid.cc/cyberes/local-llm-server/src/branch/master/other/vllm/vllm_api_server.py) matches [vllm/entrypoints/api_server.py](https://github.com/vllm-project/vllm/blob/93348d9458af7517bb8c114611d438a1b4a2c3be/vllm/entrypoints/api_server.py).
+
+Then, update the VLLM version in `requirements.txt`.
+
+
+
+## To Do
+
+- [ ] Support the Oobabooga Text Generation WebUI as a backend
+- [ ] Make the moderation apply to the non-OpenAI endpoints as well
 - [ ] Make sure stats work when starting from an empty database
- [ ] Make sure we're correctly canceling requests when the client cancels
- [ ] Make sure the OpenAI endpoint works as expected
+- [ ] Make sure we're correctly canceling requests when the client cancels. The blocking endpoints can't detect when a client cancels generation.
+- [ ] Add test to verify the OpenAI endpoint works as expected
--- a/INSTALL.md
+++ b/INSTALL.md
@ -1,4 +0,0 @@
-```bash
-wget https://git.evulid.cc/attachments/6e7bfc04-cad4-4494-a98d-1391fbb402d3 -O /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl && pip install /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl && rm /tmp/vllm-0.1.3-cp311-cp311-linux_x86_64.whl
-pip install auto_gptq
-```
--- a/config/config.yml.sample
+++ b/config/config.yml.sample
@ -1,80 +1,157 @@
-## Important
+## Main ##

-backend_url: https://10.0.0.50:8283
+frontend_api_mode:                ooba

-mode: vllm
-concurrent_gens: 3
-token_limit: 8192
+cluster:
+  - backend_url:                  http://1.2.3.4:7000
+    concurrent_gens:              3
+    mode:                         vllm
+    # higher priority number means that if lower-number priority backends fail,
+    # the proxy will fall back to backends that have greater priority numbers.
+    priority:                     16
+
+  - backend_url:                  http://4.5.6.7:9107
+    concurrent_gens:              3
+    mode:                         vllm
+    priority:                     10
+
+  - backend_url:                  http://7.8.9.0:9208
+    concurrent_gens:              3
+    mode:                         vllm
+    priority:                     10
+
+# If enabled, the "priority" of the backends will be ignored
+# and will be prioritized by the estimated parameter count instead.
+# For example, a 70b model will be a higher priority than a 13b.
+prioritize_by_size:               true
+
+# The token used to access various administration endpoints.
+admin_token:                      password1234567

 # How many requests a single IP is allowed to put in the queue.
 # If an IP tries to put more than this their request will be rejected
 # until the other(s) are completed.
-simultaneous_requests_per_ip: 2
+simultaneous_requests_per_ip:     1

-## Optional
+# The connection details for your MySQL database.
+mysql:
+  host:                           127.0.0.1
+  username:                       localllm
+  password:                       'password1234'
+  database:                       localllm

-max_new_tokens: 500
+# Manually set the HTTP host shown to the clients.
+# Comment out to auto-detect.
+# http_host:                        https://example.com

-enable_streaming: false
+# Where the server will write its logs to.
+webserver_log_directory:          /var/log/localllm

-log_prompts: false

-verify_ssl: false # Python request has issues with self-signed certs
+## Optional ##

-auth_required: false
+# Include SYSTEM tokens in the stats calculation.
+# Applies to average_generation_elapsed_sec and estimated_avg_tps.
+include_system_tokens_in_stats:   true

-max_queued_prompts_per_ip: 1
+# Run a background thread to cache the homepage. The homepage has to load
+# a lot of data so it's good to keep it cached. The thread will call whatever
+# the base API url.
+background_homepage_cacher:       true
+
+# The maximum amount of tokens a client is allowed to generate.
+max_new_tokens:                   500
+
+# Enable/disable streaming.
+enable_streaming:                 true
+
+# Show the backends that the server is configured to use. Disable this to hide them on the public homepage.
+show_backends:                    true
+
+# Log all prompt inputs and outputs.
+log_prompts:                      false
+
+# Disable the verification of SSL certificates in all HTTP requests made by the server.
+verify_ssl:                       false
+
+# Require a valid API key for all inference requests.
+auth_required:                    false

 # Name of your proxy, shown to clients.
-llm_middleware_name: local-llm-server
+llm_middleware_name:              proxy.example.co

-# Set the name of the model shown to clients
-# manual_model_name: testing123
+# Override the name of the model shown to clients. Comment out to auto-detect.
+# manual_model_name:              testing123

 # JS tracking code to add to the home page.
-# analytics_tracking_code: |
-#   alert("hello");
+# analytics_tracking_code:          |
+#   var test = 123;
+#   alert(test);

 # HTML to add under the "Estimated Wait Time" line.
-# info_html: |
-#   bla bla whatever
+info_html:                        |
+  If you are having issues with ratelimiting, try using streaming.

-enable_openi_compatible_backend: true
-# openai_api_key:
-expose_openai_system_prompt: true
-#openai_system_prompt: |
+# Enable/disable the OpenAI-compatible endpoint.
+enable_openi_compatible_backend:  true
+
+# Your OpenAI API key. Only used for the moderation API and fetching data.
+openai_api_key:                   sk-123456
+
+# Enable/disable the endpoint that shows the system prompt sent to the AI when calling the OpenAI-compatible endpoint.
+expose_openai_system_prompt:      true
+
+# Should we show our model in the OpenAI API or simulate it? If false, make sure you set
+# openai_api_key since the actual OpenAI models response will be cloned.
+openai_expose_our_model:           false
+
+# Add the string "###" to the stop string to prevent the AI from trying to speak as other characters.
+openai_force_no_hashes:           true
+
+# Enable moderating requests via OpenAI's moderation endpoint.
+openai_moderation_enabled:        true
+
+# Don't wait longer than this many seconds for the moderation request
+# to OpenAI to complete.
+openai_moderation_timeout:        5
+
+# Send the last N messages in an OpenAI request to the moderation endpoint.
+openai_moderation_scan_last_n:    5
+
+# The organization name to tell the LLM on the OpenAI endpoint so it can better simulate OpenAI's response.
+openai_org_name:                  OpenAI
+
+# Silently trim prompts to the OpenAI endpoint to fit the model's length.
+openai_silent_trim:               true
+
+# Set the system prompt for the OpenAI-compatible endpoint. Comment out to use the default.
+#openai_system_prompt:            |
 #  You are an assistant chatbot. Your main function is to provide accurate and helpful responses to the user's queries. You should always be polite, respectful, and patient. You should not provide any personal opinions or advice unless specifically asked by the user. You should not make any assumptions about the user's knowledge or abilities. You should always strive to provide clear and concise answers. If you do not understand a user's query, ask for clarification. If you cannot provide an answer, apologize and suggest the user seek help elsewhere.\nLines that start with "### ASSISTANT" were messages you sent previously.\nLines that start with "### USER" were messages sent by the user you are chatting with.\nYou will respond to the "### RESPONSE:" prompt as the assistant and follow the instructions given by the user.\n\n

 ### Tuneables ##

 # Path that is shown to users for them to connect to
-# TODO: set this based on mode. Instead, have this be the path to the API
-frontend_api_client: /api
-
-# Path to the database, relative to the directory of server.py
-database_path: ./proxy-server.db
+frontend_api_client:              /api

 # How to calculate the average generation time.
-# Valid options:              database, minute
+# Valid options:                  database, minute
 # "database" calculates average from historical data in the database, with the more recent data weighted more.
 # "minute" calculates it from the last minute of data.
-average_generation_time_mode: database
+average_generation_time_mode:     database

 ## STATS ##

+# These options control what is shown on the stats endpoint.
+
 # Display the total_proompts item on the stats screen.
-show_num_prompts: true
-
-# Display the uptime item on the stats screen.
-show_uptime: true
-
-show_total_output_tokens: true
-
-show_backend_info: true
+show_num_prompts:                 true

 # Load the number of prompts from the database to display on the stats page.
-load_num_prompts: true
+# If enabled, count all prompts in the database. If disabled, only count the prompts since the server started.
+load_num_prompts:                 true

-## NETDATA ##
+# Display the uptime item on the stats screen.
+show_uptime:                      true

-netdata_root: http://10.0.0.50:19999
+# Display the total number of tokens generated.
+show_total_output_tokens:         true
--- a/llm_server/config/config.py
+++ b/llm_server/config/config.py
@ -2,7 +2,6 @@ import yaml

 config_default_vars = {
    'log_prompts': False,
-    'database_path': './proxy-server.db',
    'auth_required': False,
    'frontend_api_client': '',
    'verify_ssl': True,
@ -14,7 +13,6 @@ config_default_vars = {
    'info_html': None,
    'show_total_output_tokens': True,
    'simultaneous_requests_per_ip': 3,
-    'show_backend_info': True,
    'max_new_tokens': 500,
    'manual_model_name': False,
    'enable_streaming': True,
@ -24,7 +22,7 @@ config_default_vars = {
    'openai_system_prompt': """You are an assistant chatbot. Your main function is to provide accurate and helpful responses to the user's queries. You should always be polite, respectful, and patient. You should not provide any personal opinions or advice unless specifically asked by the user. You should not make any assumptions about the user's knowledge or abilities. You should always strive to provide clear and concise answers. If you do not understand a user's query, ask for clarification. If you cannot provide an answer, apologize and suggest the user seek help elsewhere.\nLines that start with "### ASSISTANT" were messages you sent previously.\nLines that start with "### USER" were messages sent by the user you are chatting with.\nYou will respond to the "### RESPONSE:" prompt as the assistant and follow the instructions given by the user.\n\n""",
    'http_host': None,
    'admin_token': None,
-    'openai_epose_our_model': False,
+    'openai_expose_our_model': False,
    'openai_force_no_hashes': True,
    'include_system_tokens_in_stats': True,
    'openai_moderation_scan_last_n': 5,
--- a/llm_server/config/load.py
+++ b/llm_server/config/load.py
@ -28,7 +28,6 @@ def load_config(config_path):
    opts.show_total_output_tokens = config['show_total_output_tokens']
    opts.netdata_root = config['netdata_root']
    opts.simultaneous_requests_per_ip = config['simultaneous_requests_per_ip']
-    opts.show_backend_info = config['show_backend_info']
    opts.max_new_tokens = config['max_new_tokens']
    opts.manual_model_name = config['manual_model_name']
    opts.llm_middleware_name = config['llm_middleware_name']
@ -39,7 +38,7 @@ def load_config(config_path):
    opts.openai_api_key = config['openai_api_key']
    openai.api_key = opts.openai_api_key
    opts.admin_token = config['admin_token']
-    opts.openai_expose_our_model = config['openai_epose_our_model']
+    opts.openai_expose_our_model = config['openai_expose_our_model']
    opts.openai_force_no_hashes = config['openai_force_no_hashes']
    opts.include_system_tokens_in_stats = config['include_system_tokens_in_stats']
    opts.openai_moderation_scan_last_n = config['openai_moderation_scan_last_n']
@ -59,13 +58,12 @@ def load_config(config_path):
    llm_server.routes.queue.priority_queue = PriorityQueue([x['backend_url'] for x in config['cluster']])

    if opts.openai_expose_our_model and not opts.openai_api_key:
-        print('If you set openai_epose_our_model to false, you must set your OpenAI key in openai_api_key.')
+        print('If you set openai_expose_our_model to false, you must set your OpenAI key in openai_api_key.')
        sys.exit(1)

    opts.verify_ssl = config['verify_ssl']
    if not opts.verify_ssl:
        import urllib3
-
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    if config['http_host']:
--- a/llm_server/opts.py
+++ b/llm_server/opts.py
@ -17,7 +17,6 @@ average_generation_time_mode = 'database'
 show_total_output_tokens = True
 netdata_root = None
 simultaneous_requests_per_ip = 3
-show_backend_info = True
 manual_model_name = None
 llm_middleware_name = ''
 enable_openi_compatible_backend = True
--- a/other/local-llm-daemon.service
+++ b/other/local-llm-daemon.service
@ -5,7 +5,6 @@ After=basic.target network.target

 [Service]
 User=server
-Group=server
 ExecStart=/srv/server/local-llm-server/venv/bin/python /srv/server/local-llm-server/daemon.py
 Restart=always
 RestartSec=2
@ -13,3 +12,4 @@ SyslogIdentifier=local-llm-daemon

 [Install]
 WantedBy=multi-user.target
+
--- a/other/local-llm-server.service
+++ b/other/local-llm-server.service
@ -6,7 +6,6 @@ Requires=local-llm-daemon.service

 [Service]
 User=server
-Group=server
 WorkingDirectory=/srv/server/local-llm-server

 # Sometimes the old processes aren't terminated when the service is restarted.
@ -21,3 +20,4 @@ SyslogIdentifier=local-llm-server

 [Install]
 WantedBy=multi-user.target
+
--- a/other/vllm/Docker/Dockerfile
+++ b/other/vllm/Docker/Dockerfile
--- a/other/vllm/Docker/Dockerfile.base
+++ b/other/vllm/Docker/Dockerfile.base
--- a/other/vllm/Docker/README.md
+++ b/other/vllm/Docker/README.md
--- a/other/vllm/Docker/build-docker.sh
+++ b/other/vllm/Docker/build-docker.sh
--- a/other/vllm/Docker/idle.ipynb
+++ b/other/vllm/Docker/idle.ipynb
--- a/other/vllm/Docker/init-container.sh
+++ b/other/vllm/Docker/init-container.sh
--- a/other/vllm/Docker/start-container.sh
+++ b/other/vllm/Docker/start-container.sh
--- a/other/vllm/Docker/start-vllm.sh
+++ b/other/vllm/Docker/start-vllm.sh
--- a/other/vllm/Docker/supervisord.conf
+++ b/other/vllm/Docker/supervisord.conf
--- a/other/vllm/Docker/update-container.sh
+++ b/other/vllm/Docker/update-container.sh
--- a/other/vllm/build-vllm.sh
+++ b/other/vllm/build-vllm.sh
@ -1,38 +0,0 @@
-#!/bin/bash
-
-# Expected to be run as root in some sort of container
-
-cd /tmp || exit
-
-if [ ! -d /tmp/vllm-gptq ]; then
-  git clone https://github.com/chu-tianxiang/vllm-gptq.git
-  cd vllm-gptq || exit
-else
-  cd vllm-gptq || exit
-  git pull
-fi
-
-if [ ! -d /root/miniconda3 ]; then
-  wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3-latest-Linux-x86_64.sh
-  bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b
-  rm /tmp/Miniconda3-latest-Linux-x86_64.sh
-fi
-
-eval "$(/root/miniconda3/bin/conda shell.bash hook)"
-
-if [ ! -d /root/miniconda3/envs/vllm-gptq ]; then
-  conda create --name vllm-gptq -c conda-forge python=3.11 -y
-  conda activate vllm-gptq
-  pip install ninja
-  conda install -y -c "nvidia/label/cuda-11.8.0" cuda==11.8
-  conda install -y cudatoolkit cudnn
-else
-  conda activate vllm-gptq
-fi
-
-pip install -r requirements.txt
-
-CUDA_HOME=/root/miniconda3/envs/vllm-gptq python setup.py bdist_wheel
-
-echo -e "\n\n===\nOUTPUT:"
-find /tmp/vllm-gptq -name '*.whl'
--- a/other/vllm/vllm-gptq-setup-no-cuda.py
+++ b/other/vllm/vllm-gptq-setup-no-cuda.py
@ -1,70 +0,0 @@
-import io
-import os
-import re
-from typing import List
-
-import setuptools
-from torch.utils.cpp_extension import BuildExtension
-
-ROOT_DIR = os.path.dirname(__file__)
-
-"""
-Build vllm-gptq without any CUDA
-"""
-
-
-def get_path(*filepath) -> str:
-    return os.path.join(ROOT_DIR, *filepath)
-
-
-def find_version(filepath: str):
-    """Extract version information from the given filepath.
-
-    Adapted from https://github.com/ray-project/ray/blob/0b190ee1160eeca9796bc091e07eaebf4c85b511/python/setup.py
-    """
-    with open(filepath) as fp:
-        version_match = re.search(
-            r"^__version__ = ['\"]([^'\"]*)['\"]", fp.read(), re.M)
-        if version_match:
-            return version_match.group(1)
-        raise RuntimeError("Unable to find version string.")
-
-
-def read_readme() -> str:
-    """Read the README file."""
-    return io.open(get_path("README.md"), "r", encoding="utf-8").read()
-
-
-def get_requirements() -> List[str]:
-    """Get Python package dependencies from requirements.txt."""
-    with open(get_path("requirements.txt")) as f:
-        requirements = f.read().strip().split("\n")
-    return requirements
-
-
-setuptools.setup(
-    name="vllm-gptq",
-    version=find_version(get_path("", "__init__.py")),
-    author="vLLM Team",
-    license="Apache 2.0",
-    description="A high-throughput and memory-efficient inference and serving engine for LLMs",
-    long_description=read_readme(),
-    long_description_content_type="text/markdown",
-    url="https://github.com/vllm-project/vllm",
-    project_urls={
-        "Homepage": "https://github.com/vllm-project/vllm",
-        "Documentation": "https://vllm.readthedocs.io/en/latest/",
-    },
-    classifiers=[
-        "Programming Language :: Python :: 3.8",
-        "Programming Language :: Python :: 3.9",
-        "Programming Language :: Python :: 3.10",
-        "License :: OSI Approved :: Apache Software License",
-        "Topic :: Scientific/Engineering :: Artificial Intelligence",
-    ],
-    packages=setuptools.find_packages(
-        exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
-    python_requires=">=3.8",
-    install_requires=get_requirements(),
-    cmdclass={"build_ext": BuildExtension},
-)
--- a/other/vllm/vllm.service
+++ b/other/vllm/vllm.service
@ -4,12 +4,11 @@ Wants=basic.target
 After=basic.target network.target

 [Service]
-User=USERNAME
-Group=USERNAME
-# Can add --disable-log-requests when I know the backend won't crash
-ExecStart=/storage/vllm/venv/bin/python /storage/vllm/api_server.py --model /storage/oobabooga/one-click-installers/text-generation-webui/models/TheBloke_MythoMax-L2-13B-GPTQ/ --host 0.0.0.0 --port 7000 --max-num-batched-tokens 24576
+User=vllm
+ExecStart=/storage/vllm/vllm-venv/bin/python3.11 /storage/vllm/api_server.py --model /storage/models/awq/MythoMax-L2-13B-AWQ --quantization awq --host 0.0.0.0 --port 7000 --gpu-memory-utilization 0.95 --max-log-len 100
 Restart=always
 RestartSec=2

 [Install]
 WantedBy=multi-user.target
+
--- a/requirements.txt
+++ b/requirements.txt
@ -15,4 +15,4 @@ redis==5.0.1
 ujson==5.8.0
 vllm==0.2.7
 gradio~=3.46.1
-coloredlogs~=15.0.1
+coloredlogs~=15.0.1