Large Language Model Text Generation Inference
Go to file
Olivier Dehaene 00e6ce44b1 Update aml deployment 2022-10-17 10:39:59 +02:00
aml Update aml deployment 2022-10-17 10:39:59 +02:00
k6 Add load testing 2022-10-11 10:36:51 +02:00
proto Refactored gRPC interface 2022-10-11 16:50:54 +02:00
router feat: Add AML deployment 2022-10-15 20:21:50 +02:00
server feat: Add AML deployment 2022-10-15 20:21:50 +02:00
.dockerignore feat: Add AML deployment 2022-10-15 20:21:50 +02:00
.gitignore Add load testing 2022-10-11 10:36:51 +02:00
Dockerfile feat: Docker image 2022-10-14 15:56:21 +02:00
README.md feat: Docker image 2022-10-14 15:56:21 +02:00
run.sh feat: Docker image 2022-10-14 15:56:21 +02:00

README.md

Text Generation Inference

A Rust and gRPC server for text generation inference.

Load Tests

See k6/load_test.js We send the default examples with a 1 second delay between each request.

Stages:

  • Ramp up to 50 concurrent requests per second in 1min
  • Ramp up from 50 to 100 concurrent requests per second in 2min
  • Ramp down to 0 concurrent requests per second in 1min
avg min med max p(90) p(95) RPS
Original code 8.9s 1s 9.12s 16.69s 13.7s 14.26s 5.9
ISO with original code 8.88s 959.53ms 8.89s 17.08s 13.34s 14.12s 5.94
New batching logic 5.44s 1.27s 5.28s 13.12s 7.78s 8.92s 9.08

Install

cd server
pip install .
cd router
cargo build --release

Run

python server/bloom_inference/main.py bigscience/bloom --num-gpus 8 --shard-directory /dev/shm/models
./router/target/release/router

TODO:

  • Add batching args to router CLI
  • Add docstrings + comments everywhere as the codebase is fairly complicated
  • Add tests
  • Add shutdown logic in router and server
  • Improve multi-processing logic in server
  • Improve error handling everywhere
  • Improve past key layer indexing?