fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
This commit is contained in:
Nick Hill 2022-12-30 10:31:44 -08:00 committed by GitHub
parent 686cc66717
commit 3efa5bbbfd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 1 additions and 1 deletions

View File

@ -131,7 +131,7 @@ fn validation_worker(
} }
// Get the number of tokens in the input // Get the number of tokens in the input
match tokenizer.encode(request.inputs.clone(), false) { match tokenizer.encode(request.inputs.clone(), true) {
Ok(inputs) => { Ok(inputs) => {
let input_length = inputs.len(); let input_length = inputs.len();