local-llm-server/README.md at ab408c6c5bb50d06a9dffb55ebc0ffe8eb59f757

3.1 KiB

Raw Blame History

A Docker container for running VLLM on Paperspace Gradient notebooks.

Running

In Paperspace, create a new notebook.
Click Start from Scratch.
Select your GPU and set the auto-shutdown timeout to 6 hours.
Click the View Advanced Options button at the bottom of the page. Enter these details in the form that appears:
- Container Name: cyberes/vllm-paperspace:latest
- Container Command: /app/start.sh
Start the notebook. It may take up to five minutes for them to pull and start the custom image.
Once the container is started, open the log viewer by clicking the icon in the bottom left of the screen. You should see errors from rathole and VLLM as a result of the blank config files. The container will create a new directory in your mounted storage: /storage/vllm/.
Enter your rathole client config in /storage/vllm/rathole-client.toml. If you need a visual text editor, first link the directory back to the Jupyter home: ln -s /storage/vllm /notebooks
Restart rathole with supervisorctl restart rathole and then view the log: tail -f /var/log/app/rathole.log. If you see lines that start with INFO and end with Control channel established, rathole has connected and is working. Error mesasges will begin with ERROR.
Download an AWQ quantization from TheBloke to /storage/vllm/models/.
Enter your VLLM commandline args in /storage/vllm/cmd.txt. You need to set --model to the path of the model you want to load.
Restart VLLM with supervisorctl restart vllm and then view the log: tail -f /var/log/app/vllm.log. It may take up to three minutes to load. When you see the line:

INFO:     Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)

VLLM is running and ready for queries.

In /notebooks (the home directory of Jupyter), the notebook idle.ipynb will automatically be created. Run this notebook so Paperspace does not shut down your machine due to "inactivity". You must keep the running notebook open in a browser tab.

Building

You must have a GPU attached to your system when building the container (required for building VLLM).

Install the NVIDIA Container Toolkit and CUDA 11.8.
bash build-docker.sh

To run the container on your local machine:

sudo docker run -it --shm-size 14g --gpus all -v /home/user/testing123/notebooks:/notebooks -v /home/user/testing123/storage:/storage -p 8888:8888 cyberes/vllm-paperspace:latest

You will need to create a directory to mount inside the container (for example: /home/user/testing123/). Within this should be the folder models that holds the model to load, rathole-client.toml, and cmd.txt.

If you need to debug something, you can start a shell inside the container:

sudo docker run -it --shm-size 14g --gpus all -v /home/user/testing123/notebooks:/notebooks -v /home/user/testing123/storage:/storage -p 8888:8888 --entrypoint bash cyberes/vllm-paperspace:latest

3.1 KiB Raw Blame History

Running

Building

3.1 KiB

Raw Blame History