Large Language Model Text Generation Inference
Go to file
OlivierDehaene beb552127a feat(client): Simplify sharded logic 2022-10-22 23:40:05 +02:00
aml v0.1.0 2022-10-20 19:14:44 +02:00
assets v0.1.0 2022-10-20 19:14:44 +02:00
k6 Add load testing 2022-10-11 10:36:51 +02:00
launcher feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
proto v0.1.0 2022-10-20 19:14:44 +02:00
router feat(client): Simplify sharded logic 2022-10-22 23:40:05 +02:00
server feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
.dockerignore v0.1.0 2022-10-20 19:14:44 +02:00
.gitignore v0.1.0 2022-10-20 19:14:44 +02:00
Cargo.lock v0.1.0 2022-10-20 19:14:44 +02:00
Cargo.toml feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
Dockerfile feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
LICENSE Create LICENSE (#2) 2022-10-22 10:44:52 +02:00
Makefile feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
README.md feat(server): Use safetensors 2022-10-22 20:00:15 +02:00
rust-toolchain.toml v0.1.0 2022-10-20 19:14:44 +02:00

README.md

LLM Text Generation Inference

architecture

A Rust and gRPC server for large language models text generation inference.

Load Tests for BLOOM

See k6/load_test.js We send the default examples with a 1 second delay between requests.

Stages:

  • Ramp up to 50 vus in 1min
  • Ramp up from 50 to 100 vus in 2min
  • Ramp down to 0 vus in 1min
avg min med max p(90) p(95) RPS
Original code 8.9s 1s 9.12s 16.69s 13.7s 14.26s 5.9
ISO with original code 8.88s 959.53ms 8.89s 17.08s 13.34s 14.12s 5.94
New batching logic 5.44s 1.27s 5.28s 13.12s 7.78s 8.92s 9.08

Install

make install

Run

make run-bloom-560m

Test

curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Develop

make server-dev
make router-dev

TODO:

  • Add tests for the server/model logic
  • Backport custom CUDA kernels to Transformers
  • Install safetensors with pip