Thank you so much for the work you are doing, this is my little
contribution to this great thing you have built. I hope it is useful and
helpful, please don't hesitate to discuss any matters that are not
clear!
I am basing my implementation of frequency penalty on OpenAI's
implementation:
https://platform.openai.com/docs/guides/text-generation/parameter-details
The problem I see with TGI's current implementation is that is not
taking into account the frequency of tokens which have already been
sampled in the current generation stream. Also, the scaling is of the
adjusted token logits is done differently for positive and negative
logits. While in OpenAI's implementation token frequency is taking into
account and the scaling is always done with a subtraction (if penalty is
positive) or add operation (if penalty is negative).
This leads to corrupt generations as I mentioned in issue #1810 .
Moreover, after my tests, other issues are also gone like the one about
some request's with ``penalty_frequency = 1.0`` overruling other
requests (with ``frequency_penalty = 0.0``) in the same batch and
therefore corrupting all generations in the batch. Basically, padding
does not affect this implementation so I believe this ``score *=
input_ids.ne(0)`` is not needed anymore.
Frequency penalty | -1.0 | 0.0 | 1.0
-- | -- | -- | --
Before my change | https://paste.mozilla.org/JxqGJkWY |
https://paste.mozilla.org/hrztJ56h | https://paste.mozilla.org/pBSEH2zw
After my change | https://paste.mozilla.org/7gXCi7zo |
https://paste.mozilla.org/ZR9rJ92g | https://paste.mozilla.org/gHaD2YnC
---------
Co-authored-by: martini <martin.iglesiasgoyanes@adyen.com>
# What does this PR do?
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)
![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>