hf_text-generation-inference/README.md

# LLM Text Generation Inference

<div align="center">

![architecture](assets/architecture.jpg)

</div>

A Rust and gRPC server for large language models text generation inference.

## Load Tests for BLOOM

See `k6/load_test.js`
We send the default examples with a 1 second delay between requests.

Stages: 
- Ramp up to 50 vus in 1min
- Ramp up from 50 to 100 vus in 2min
- Ramp down to 0 vus in 1min


|                                                              | avg       | min          | med       | max        | p(90)     | p(95)     | RPS      |
|--------------------------------------------------------------|-----------|--------------|-----------|------------|-----------|-----------|----------|
| [Original code](https://github.com/huggingface/transformers_bloom_parallel) | 8.9s      | 1s           | 9.12s     | 16.69s     | 13.7s     | 14.26s    | 5.9      |
| ISO with original code                                       | 8.88s     | **959.53ms** | 8.89s     | 17.08s     | 13.34s    | 14.12s    | 5.94     |
| New batching logic                                           | **5.44s** | 1.27s        | **5.28s** | **13.12s** | **7.78s** | **8.92s** | **9.08** |

## Install

```shell
make install
```

## Run 

```shell
make run-bloom-560m
```

## Test

```shell
curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'
```

## Develop

```shell
make server-dev
make router-dev
```

## TODO:

- [ ] Add tests for the `server/model` logic
- [ ] Backport custom CUDA kernels to Transformers
- [ ] Install safetensors with pip
v0.1.0 2022-10-18 07:19:03 -06:00			`# LLM Text Generation Inference`
Init 2022-10-08 04:30:12 -06:00
v0.1.0 2022-10-18 07:19:03 -06:00			`<div align="center">`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
v0.1.0 2022-10-18 07:19:03 -06:00			`![architecture](assets/architecture.jpg)`

			`</div>`

			`A Rust and gRPC server for large language models text generation inference.`

			`## Load Tests for BLOOM`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
			See `k6/load_test.js`
v0.1.0 2022-10-18 07:19:03 -06:00			`We send the default examples with a 1 second delay between requests.`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00
			`Stages:`
v0.1.0 2022-10-18 07:19:03 -06:00			`- Ramp up to 50 vus in 1min`
			`- Ramp up from 50 to 100 vus in 2min`
			`- Ramp down to 0 vus in 1min`
Refactored gRPC interface Added validation logic 2022-10-11 08:50:54 -06:00

v0.1.0 2022-10-18 07:19:03 -06:00			`\| \| avg \| min \| med \| max \| p(90) \| p(95) \| RPS \|`
			`\|--------------------------------------------------------------\|-----------\|--------------\|-----------\|------------\|-----------\|-----------\|----------\|`
			`\| [Original code](https://github.com/huggingface/transformers_bloom_parallel) \| 8.9s \| 1s \| 9.12s \| 16.69s \| 13.7s \| 14.26s \| 5.9 \|`
			`\| ISO with original code \| 8.88s \| 959.53ms \| 8.89s \| 17.08s \| 13.34s \| 14.12s \| 5.94 \|`
			`\| New batching logic \| 5.44s \| 1.27s \| 5.28s \| 13.12s \| 7.78s \| 8.92s \| 9.08 \|`
Init 2022-10-08 04:30:12 -06:00
			`## Install`

			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`make install`
Init 2022-10-08 04:30:12 -06:00			```

v0.1.0 2022-10-18 07:19:03 -06:00			`## Run`
Init 2022-10-08 04:30:12 -06:00
			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`make run-bloom-560m`
Init 2022-10-08 04:30:12 -06:00			```

v0.1.0 2022-10-18 07:19:03 -06:00			`## Test`

Init 2022-10-08 04:30:12 -06:00			```shell
v0.1.0 2022-10-18 07:19:03 -06:00			`curl 127.0.0.1:3000/generate \`
feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00			`-v \`
v0.1.0 2022-10-18 07:19:03 -06:00			`-X POST \`
			`-d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \`
			`-H 'Content-Type: application/json'`
Init 2022-10-08 04:30:12 -06:00			```

feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00			`## Develop`

			```shell
			`make server-dev`
			`make router-dev`
			```

Init 2022-10-08 04:30:12 -06:00			`## TODO:`

feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00			- [ ] Add tests for the `server/model` logic
			`- [ ] Backport custom CUDA kernels to Transformers`
			`- [ ] Install safetensors with pip`