c6bb76703f
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
This forces the use of `bfloat16` for IDEFICS. The issue is that with
`float16` the 80b model gives garbage output. Let me know if this
solution is not appropriate and I'll adjust accordingly. For the details
see below.
The current behaviour:
```sh
$ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
{"generated_text":""}
```
On closer inspection with:
```python
import requests
headers = { "Content-Type": "application/json"}
query = "What is Deep Learning?"
data = {
"inputs": query,
"parameters": {
"max_new_tokens": 10,
"return_full_text": True,
"decoder_input_details": True,
"do_sample": False,
},
}
api_url = "http://127.0.0.1:8080"
response = requests.post(api_url + "/generate", headers=headers, json=data).json()
for i in ['prefill', 'tokens']:
print(f'### {i}')
print(repr(''.join([t['text'] for t in response['details'][i]])))
```
Prints:
```
### prefill
'<s>WhatisDeepLearning?'
### tokens
'<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>'
########
```
With the change in this PR it prints:
```
### prefill
'<s>WhatisDeepLearning?'
### tokens
'\n\nDeep Learning is a subset of machine'
```
Note, using the Transformers implementation (with
`IdeficsForVisionText2Text.from_pretrained`) produces the latter
(correct) output as well.
This only happens with the 80b model, the 9b model is not as sensitive
to the dtype (as also mentioned in the code).
The reason for "forcing" this in the IDEFICS init method, is because if
quantization is used, then the dtype cannot be set explicitly. And since
it's left as `None`, it's set to `float16` by default
[here](
|
||
---|---|---|
.. | ||
custom_kernels | ||
exllama_kernels | ||
tests | ||
text_generation_server | ||
.gitignore | ||
Makefile | ||
Makefile-awq | ||
Makefile-eetq | ||
Makefile-flash-att | ||
Makefile-flash-att-v2 | ||
Makefile-vllm | ||
README.md | ||
poetry.lock | ||
pyproject.toml | ||
requirements.txt |
README.md
Text Generation Inference Python gRPC Server
A Python gRPC server for Text Generation Inference
Install
make install
Run
make run-dev