Large Language Model Text Generation Inference
Go to file
OlivierDehaene 3cf6368c77 feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
aml feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
assets v0.1.0 2022-10-20 19:14:44 +02:00
k6 Add load testing 2022-10-11 10:36:51 +02:00
launcher feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
proto v0.1.0 2022-10-20 19:14:44 +02:00
router feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
server feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
.dockerignore v0.1.0 2022-10-20 19:14:44 +02:00
.gitignore v0.1.0 2022-10-20 19:14:44 +02:00
Cargo.lock feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
Cargo.toml feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
Dockerfile feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
LICENSE Create LICENSE (#2) 2022-10-22 10:44:52 +02:00
Makefile feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
README.md feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
rust-toolchain.toml v0.1.0 2022-10-20 19:14:44 +02:00

README.md

LLM Text Generation Inference

architecture

A Rust and gRPC server for large language models text generation inference.

Features

Officially supported models

  • BLOOM
  • BLOOM-560m

Other models are supported on a best-effort basis using AutoModelForCausalLM.from_pretrained(<model>, torch_dtype=torch.float16, device_map="auto").

Load Tests for BLOOM

See k6/load_test.js

avg min med max p(90) p(95) RPS
Original code 8.9s 1s 9.12s 16.69s 13.7s 14.26s 5.9
New batching logic 5.44s 959.53ms 5.28s 13.12s 7.78s 8.92s 9.08

Install

make install

Run

BLOOM 560-m

make run-bloom-560m

BLOOM

First you need to download the weights:

make download-bloom
make run-bloom # Requires 8xA100 80GB

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Test

curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Develop

make server-dev
make router-dev

TODO:

  • Add tests for the server/model logic
  • Backport custom CUDA kernels to Transformers
  • Install safetensors with pip