Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

OlivierDehaene 09674e6df9 feat(server): Support bitsandbytes		2022-10-27 14:25:29 +02:00
aml	v0.1.0	2022-10-20 19:14:44 +02:00
assets	v0.1.0	2022-10-20 19:14:44 +02:00
k6	Add load testing	2022-10-11 10:36:51 +02:00
launcher	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
proto	v0.1.0	2022-10-20 19:14:44 +02:00
router	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
server	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
.dockerignore	v0.1.0	2022-10-20 19:14:44 +02:00
.gitignore	v0.1.0	2022-10-20 19:14:44 +02:00
Cargo.lock	v0.1.0	2022-10-20 19:14:44 +02:00
Cargo.toml	feat(server): Use safetensors	2022-10-22 20:00:15 +02:00
Dockerfile	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
LICENSE	Create LICENSE (#2 )	2022-10-22 10:44:52 +02:00
Makefile	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
README.md	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
rust-toolchain.toml	v0.1.0	2022-10-20 19:14:44 +02:00

README.md

LLM Text Generation Inference

A Rust and gRPC server for large language models text generation inference.

Features

Quantization with bitsandbytes
Dynamic bathing of incoming requests for increased total throughput
Safetensors weight loading
45ms per token generation for BLOOM with 8xA100 80GB

Supported models

BLOOM
BLOOM-560m

Load Tests for BLOOM

See k6/load_test.js

	avg	min	med	max	p(90)	p(95)	RPS
Original code	8.9s	1s	9.12s	16.69s	13.7s	14.26s	5.9
New batching logic	5.44s	959.53ms	5.28s	13.12s	7.78s	8.92s	9.08

Install

make install

Run

BLOOM 560-m

make run-bloom-560m

BLOOM

First you need to download the weights:

make download-bloom

make run-bloom # Requires 8xA100 80GB

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Test

curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Develop

make server-dev
make router-dev

TODO:

Add tests for the server/model logic
Backport custom CUDA kernels to Transformers
Install safetensors with pip