I’m not sure if this is a bug, a feature, or potential known tokenizers/transformers library limitation that can’t be avoided, but I would like to highlight that when T5 model and its tokenizer is used for preparing inputs for summarization model fine-tuning, it uses suboptimal encoding scheme.
Concretely, when the tokenizer is set to apply truncation and the input text sequence is longer than
max_length (i.e. the tokenizer truncates it), the produced labels tensor always contain EOS token (here eos_token_id = 1). This behaviour in fact should not be present because the sequence has not been ended properly, just truncated, and as a result, model learns to generate EOS token incorrectly also in cases when it should not to.
If the input sequence is shorter than
max_length, the problem does not occur because in this case, the labels should contain eos token, however, the generated
decoder_input_ids tensor (using
prepare_decoder_input_ids_from_labels()) contains EOS token, which also should not be present, although this is not a serious problem as the loss computation excludes this token/timestep from consideration due to its label being pad token-ed.
My question is: should not be mentioned in the documentation, that this kind of limitation is present in the model? In my opinion, this is a tradeoff for performance purposes, because the proper handling of the EOS token would require “conditional tokenization logic”, meaning if the sequence is longer than max_length, do not apply EOS token, which is currently not supported by tokenizers library and I think would not be in the near future.
from transformers import T5TokenizerFast, T5ForConditionalGeneration model:T5ForConditionalGeneration = T5ForConditionalGeneration.from_pretrained("t5-small") tokenizer: T5TokenizerFast = T5TokenizerFast.from_pretrained("t5-small") text = "Here is a short article." text_2 = "Here is a long article, which exceeds model max length." texts = [text, text_2] inputs = tokenizer(texts, padding=True, truncation=True, max_length=10, return_tensors="pt") labels = inputs["input_ids"] decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels) print(labels) print(decoder_input_ids)