Hello,
I’m not sure if this is a bug, a feature, or potential known tokenizers/transformers library limitation that can’t be avoided, but I would like to highlight that when T5 model and its tokenizer is used for preparing inputs for summarization model fine-tuning, it uses suboptimal encoding scheme.
Concretely, when the tokenizer is set to apply truncation and the input text sequence is longer than max_length
(i.e. the tokenizer truncates it), the produced labels tensor always contain EOS token (here eos_token_id = 1). This behaviour in fact should not be present because the sequence has not been ended properly, just truncated, and as a result, model learns to generate EOS token incorrectly also in cases when it should not to.
If the input sequence is shorter than max_length
, the problem does not occur because in this case, the labels should contain eos token, however, the generated decoder_input_ids
tensor (using prepare_decoder_input_ids_from_labels()
) contains EOS token, which also should not be present, although this is not a serious problem as the loss computation excludes this token/timestep from consideration due to its label being pad token-ed.
My question is: should not be mentioned in the documentation, that this kind of limitation is present in the model? In my opinion, this is a tradeoff for performance purposes, because the proper handling of the EOS token would require “conditional tokenization logic”, meaning if the sequence is longer than max_length, do not apply EOS token, which is currently not supported by tokenizers library and I think would not be in the near future.
from transformers import T5TokenizerFast, T5ForConditionalGeneration
model:T5ForConditionalGeneration = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer: T5TokenizerFast = T5TokenizerFast.from_pretrained("t5-small")
text = "Here is a short article."
text_2 = "Here is a long article, which exceeds model max length."
texts = [text, text_2]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=10, return_tensors="pt")
labels = inputs["input_ids"]
decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels)
print(labels)
print(decoder_input_ids)