hf_text-generation-inference/README.md

61 lines
1.6 KiB
Markdown
Raw Normal View History

2022-10-18 07:19:03 -06:00
# LLM Text Generation Inference
2022-10-08 04:30:12 -06:00
2022-10-18 07:19:03 -06:00
<div align="center">
2022-10-18 07:19:03 -06:00
![architecture](assets/architecture.jpg)
</div>
A Rust and gRPC server for large language models text generation inference.
## Load Tests for BLOOM
See `k6/load_test.js`
2022-10-18 07:19:03 -06:00
We send the default examples with a 1 second delay between requests.
Stages:
2022-10-18 07:19:03 -06:00
- Ramp up to 50 vus in 1min
- Ramp up from 50 to 100 vus in 2min
- Ramp down to 0 vus in 1min
2022-10-18 07:19:03 -06:00
| | avg | min | med | max | p(90) | p(95) | RPS |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code](https://github.com/huggingface/transformers_bloom_parallel) | 8.9s | 1s | 9.12s | 16.69s | 13.7s | 14.26s | 5.9 |
| ISO with original code | 8.88s | **959.53ms** | 8.89s | 17.08s | 13.34s | 14.12s | 5.94 |
| New batching logic | **5.44s** | 1.27s | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |
2022-10-08 04:30:12 -06:00
## Install
```shell
2022-10-18 07:19:03 -06:00
make install
2022-10-08 04:30:12 -06:00
```
2022-10-18 07:19:03 -06:00
## Run
2022-10-08 04:30:12 -06:00
```shell
2022-10-18 07:19:03 -06:00
make run-bloom-560m
2022-10-08 04:30:12 -06:00
```
2022-10-18 07:19:03 -06:00
## Test
2022-10-08 04:30:12 -06:00
```shell
2022-10-18 07:19:03 -06:00
curl 127.0.0.1:3000/generate \
-v \
2022-10-18 07:19:03 -06:00
-X POST \
-d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
-H 'Content-Type: application/json'
2022-10-08 04:30:12 -06:00
```
## Develop
```shell
make server-dev
make router-dev
```
2022-10-08 04:30:12 -06:00
## TODO:
- [ ] Add tests for the `server/model` logic
- [ ] Backport custom CUDA kernels to Transformers
- [ ] Install safetensors with pip