fix: only use eos_token_id as pad_token_id if int (#2774)

LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
This commit is contained in:
Dmitry Rogozhkin 2024-12-01 21:26:37 -08:00 committed by GitHub
parent 2c74c55637
commit 535149d872
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 1 additions and 1 deletions

View File

@ -630,7 +630,7 @@ class CausalLM(Model):
if tokenizer.pad_token_id is None:
if model.config.pad_token_id is not None:
tokenizer.pad_token_id = model.config.pad_token_id
elif model.config.eos_token_id is not None:
elif model.config.eos_token_id is not None and isinstance(model.config.eos_token_id, int):
tokenizer.pad_token_id = model.config.eos_token_id
elif tokenizer.eos_token_id is not None:
tokenizer.pad_token_id = tokenizer.eos_token_id