hf_text-generation-inference/README.md

# Text Generation Inference

<div align="center">

![architecture](assets/architecture.jpg)

</div>

A Rust and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) 
to power Bloom, BloomZ and MT0-XXL api-inference widgets.

## Features

- [Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
- 45ms per token generation for BLOOM with 8xA100 80GB
- Logits warpers (temperature scaling, topk ...)
- Stop sequences
- Log probabilities

## Officially supported models

- [BLOOM](https://huggingface.co/bigscience/bloom)
- [BLOOMZ](https://huggingface.co/bigscience/bloomz)
- [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
- ~~[Galactica](https://huggingface.co/facebook/galactica-120b)~~ (deactivated)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
- [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b): use `--revision pr/13`

Other models are supported on a best effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

## Load Tests for BLOOM

See `k6/load_test.js`

|                                                              | avg       | min          | med       | max        | p(90)     | p(95)     | RPS      |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code](https://github.com/huggingface/transformers_bloom_parallel) | 8.9s      | 1s           | 9.12s     | 16.69s     | 13.7s     | 14.26s    | 5.9      |
| New batching logic                                           | **5.44s** | **959.53ms** | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |

## Install

```shell
make install
```

## Run 

### BLOOM 560-m

```shell
make run-bloom-560m
```

### BLOOM

First you need to download the weights:

```shell
make download-bloom
```

```shell
make run-bloom # Requires 8xA100 80GB
```

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
make run-bloom-quantize # Requires 8xA100 40GB
```

## Test

```shell
curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'
```

## Develop

```shell
make server-dev
make router-dev
```
feat: Use json formatter by default in docker image 2022-11-02 10:29:56 -06:00			`# Text Generation Inference`
Init 2022-10-08 04:30:12 -06:00
v0.1.0 2022-10-18 07:19:03 -06:00			`<div align="center">`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
v0.1.0 2022-10-18 07:19:03 -06:00			`![architecture](assets/architecture.jpg)`

			`</div>`

feat(server): Improved doc 2022-11-07 04:53:56 -07:00			`A Rust and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)`
			`to power Bloom, BloomZ and MT0-XXL api-inference widgets.`
v0.1.0 2022-10-18 07:19:03 -06:00
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`## Features`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
fix(readme): Typo 2022-11-14 08:22:10 -07:00			`- [Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput`
feat: Use json formatter by default in docker image 2022-11-02 10:29:56 -06:00			`- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)`
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`- [Safetensors](https://github.com/huggingface/safetensors) weight loading`
			`- 45ms per token generation for BLOOM with 8xA100 80GB`
feat: Support stop sequences (#7) 2022-12-12 10:25:22 -07:00			`- Logits warpers (temperature scaling, topk ...)`
			`- Stop sequences`
feat: Return logprobs (#8) 2022-12-15 09:03:56 -07:00			`- Log probabilities`
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00
feat(server): Improved doc 2022-11-07 04:53:56 -07:00			`## Officially supported models`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
feat(server): Improved doc 2022-11-07 04:53:56 -07:00			`- [BLOOM](https://huggingface.co/bigscience/bloom)`
			`- [BLOOMZ](https://huggingface.co/bigscience/bloomz)`
			`- [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)`
feat(server): Support Galactica (#4) 2022-12-01 11:31:54 -07:00			`- ~~[Galactica](https://huggingface.co/facebook/galactica-120b)~~ (deactivated)`
feat(server): Support SantaCoder (#26) 2023-01-20 04:24:39 -07:00			`- [SantaCoder](https://huggingface.co/bigcode/santacoder)`
feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48) 2023-02-01 06:43:59 -07:00			- [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b): use `--revision pr/13`
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00
feat(server): Support AutoModelForSeq2SeqLM 2022-11-04 11:03:04 -06:00			`Other models are supported on a best effort basis using:`

			`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

			`or`

			`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`## Load Tests for BLOOM`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			See `k6/load_test.js`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
v0.1.0 2022-10-18 07:19:03 -06:00			`\| \| avg \| min \| med \| max \| p(90) \| p(95) \| RPS \|`
			`\|--------------------------------------------------------------\|-----------\|--------------\|-----------\|------------\|-----------\|-----------\|----------\|`
			`\| [Original code](https://github.com/huggingface/transformers_bloom_parallel) \| 8.9s \| 1s \| 9.12s \| 16.69s \| 13.7s \| 14.26s \| 5.9 \|`
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`\| New batching logic \| 5.44s \| 959.53ms \| 5.28s \| 13.12s \| 7.78s \| 8.92s \| 9.08 \|`
Init 2022-10-08 04:30:12 -06:00
			`## Install`

			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`make install`
Init 2022-10-08 04:30:12 -06:00			```

v0.1.0 2022-10-18 07:19:03 -06:00			`## Run`
Init 2022-10-08 04:30:12 -06:00
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`### BLOOM 560-m`

Init 2022-10-08 04:30:12 -06:00			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`make run-bloom-560m`
Init 2022-10-08 04:30:12 -06:00			```

feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00			`### BLOOM`

			`First you need to download the weights:`

			```shell
			`make download-bloom`
			```

			```shell
			`make run-bloom # Requires 8xA100 80GB`
			```

			`You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:`

			```shell
			`make run-bloom-quantize # Requires 8xA100 40GB`
			```

v0.1.0 2022-10-18 07:19:03 -06:00			`## Test`

Init 2022-10-08 04:30:12 -06:00			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`curl 127.0.0.1:3000/generate \`
feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00			`-v \`
v0.1.0 2022-10-18 07:19:03 -06:00			`-X POST \`
			`-d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \`
			`-H 'Content-Type: application/json'`
Init 2022-10-08 04:30:12 -06:00			```

feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00			`## Develop`

			```shell
			`make server-dev`
			`make router-dev`
feat(server): Add model tests (#6) 2022-12-08 10:49:33 -07:00			```