Large Language Model Text Generation Inference
Go to file
Olivier Dehaene e86ecbac63 ValidationError was not correctly handled 2022-10-11 16:53:40 +02:00
k6
proto
router ValidationError was not correctly handled 2022-10-11 16:53:40 +02:00
server
.gitignore
README.md

README.md

Text Generation Inference

A Rust and gRPC server for text generation inference.

Load Tests

See k6/load_test.js We send the default examples with a 1 second delay between each request.

Stages:

  • Ramp up to 50 concurrent requests per second in 1min
  • Ramp up from 50 to 100 concurrent requests per second in 2min
  • Ramp down to 0 concurrent requests per second in 1min
avg min med max p(90) p(95) RPS
Original code 8.9s 1s 9.12s 16.69s 13.7s 14.26s 5.9
ISO with original code 8.88s 959.53ms 8.89s 17.08s 13.34s 14.12s 5.94
New batching logic 5.44s 1.27s 5.28s 13.12s 7.78s 8.92s 9.08

Install

cd server
pip install .
cd router
cargo build --release

Run

python server/bloom_inference/main.py bigscience/bloom --num-gpus 8 --shard-directory /dev/shm/models
./router/target/release/router

TODO:

  • Improve model download
    • Store "shardable" layers separately and layer by layer
  • Add batching args to router CLI
  • Add docstrings + comments everywhere as the codebase is fairly complicated
  • Add tests
  • Add shutdown logic in router and server
  • Improve multi-processing logic in server
  • Improve error handling everywhere
  • Improve past key layer indexing?