2022-11-02 10:29:56 -06:00
# Text Generation Inference
2022-10-08 04:30:12 -06:00
2022-10-18 07:19:03 -06:00
< div align = "center" >
2022-10-11 08:50:54 -06:00
2022-10-18 07:19:03 -06:00
![architecture ](assets/architecture.jpg )
< / div >
2022-11-07 04:53:56 -07:00
A Rust and gRPC server for text generation inference. Used in production at [HuggingFace ](https://huggingface.co )
to power Bloom, BloomZ and MT0-XXL api-inference widgets.
2022-10-18 07:19:03 -06:00
2022-10-27 06:25:29 -06:00
## Features
2022-10-11 08:50:54 -06:00
2022-11-14 08:22:10 -07:00
- [Dynamic batching of incoming requests ](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88 ) for increased total throughput
2022-11-02 10:29:56 -06:00
- Quantization with [bitsandbytes ](https://github.com/TimDettmers/bitsandbytes )
2022-10-27 06:25:29 -06:00
- [Safetensors ](https://github.com/huggingface/safetensors ) weight loading
- 45ms per token generation for BLOOM with 8xA100 80GB
2022-12-12 10:25:22 -07:00
- Logits warpers (temperature scaling, topk ...)
- Stop sequences
2022-12-15 09:03:56 -07:00
- Log probabilities
2022-10-27 06:25:29 -06:00
2022-11-07 04:53:56 -07:00
## Officially supported models
2022-10-11 08:50:54 -06:00
2022-11-07 04:53:56 -07:00
- [BLOOM ](https://huggingface.co/bigscience/bloom )
- [BLOOMZ ](https://huggingface.co/bigscience/bloomz )
- [MT0-XXL ](https://huggingface.co/bigscience/mt0-xxl )
2022-12-01 11:31:54 -07:00
- ~~[Galactica](https://huggingface.co/facebook/galactica-120b)~~ (deactivated)
2023-01-20 04:24:39 -07:00
- [SantaCoder ](https://huggingface.co/bigcode/santacoder )
2023-02-01 06:43:59 -07:00
- [GPT-Neox 20B ](https://huggingface.co/EleutherAI/gpt-neox-20b ): use `--revision pr/13`
2022-10-27 06:25:29 -06:00
2022-11-04 11:03:04 -06:00
Other models are supported on a best effort basis using:
`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
or
`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`
2022-10-27 06:25:29 -06:00
## Load Tests for BLOOM
2022-10-11 08:50:54 -06:00
2022-10-27 06:25:29 -06:00
See `k6/load_test.js`
2022-10-11 08:50:54 -06:00
2022-10-18 07:19:03 -06:00
| | avg | min | med | max | p(90) | p(95) | RPS |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code ](https://github.com/huggingface/transformers_bloom_parallel ) | 8.9s | 1s | 9.12s | 16.69s | 13.7s | 14.26s | 5.9 |
2022-10-27 06:25:29 -06:00
| New batching logic | **5.44s** | **959.53ms** | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |
2022-10-08 04:30:12 -06:00
## Install
```shell
2022-10-18 07:19:03 -06:00
make install
2022-10-08 04:30:12 -06:00
```
2022-10-18 07:19:03 -06:00
## Run
2022-10-08 04:30:12 -06:00
2022-10-27 06:25:29 -06:00
### BLOOM 560-m
2022-10-08 04:30:12 -06:00
```shell
2022-10-18 07:19:03 -06:00
make run-bloom-560m
2022-10-08 04:30:12 -06:00
```
2022-10-27 06:25:29 -06:00
### BLOOM
First you need to download the weights:
```shell
make download-bloom
```
```shell
make run-bloom # Requires 8xA100 80GB
```
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
```shell
make run-bloom-quantize # Requires 8xA100 40GB
```
2022-10-18 07:19:03 -06:00
## Test
2022-10-08 04:30:12 -06:00
```shell
2022-10-18 07:19:03 -06:00
curl 127.0.0.1:3000/generate \
2022-10-22 12:00:15 -06:00
-v \
2022-10-18 07:19:03 -06:00
-X POST \
-d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
-H 'Content-Type: application/json'
2022-10-08 04:30:12 -06:00
```
2022-10-22 12:00:15 -06:00
## Develop
```shell
make server-dev
make router-dev
2022-12-08 10:49:33 -07:00
```