There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.
This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.
This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.
As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>