T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning


I’m not sure if this is a bug, a feature, or potential known tokenizers/transformers library limitation that can’t be avoided, but I would like to highlight that when T5 model and its tokenizer is used for preparing inputs for summarization model fine-tuning, it uses suboptimal encoding scheme.

Concretely, when the tokenizer is set to apply truncation and the input text sequence is longer than max_length (i.e. the tokenizer truncates it), the produced labels tensor always contain EOS token (here eos_token_id = 1). This behaviour in fact should not be present because the sequence has not been ended properly, just truncated, and as a result, model learns to generate EOS token incorrectly also in cases when it should not to.

If the input sequence is shorter than max_length, the problem does not occur because in this case, the labels should contain eos token, however, the generated decoder_input_ids tensor (using prepare_decoder_input_ids_from_labels()) contains EOS token, which also should not be present, although this is not a serious problem as the loss computation excludes this token/timestep from consideration due to its label being pad token-ed.

My question is: should not be mentioned in the documentation, that this kind of limitation is present in the model? In my opinion, this is a tradeoff for performance purposes, because the proper handling of the EOS token would require “conditional tokenization logic”, meaning if the sequence is longer than max_length, do not apply EOS token, which is currently not supported by tokenizers library and I think would not be in the near future.

    from transformers import T5TokenizerFast, T5ForConditionalGeneration
    model:T5ForConditionalGeneration = T5ForConditionalGeneration.from_pretrained("t5-small")
    tokenizer: T5TokenizerFast = T5TokenizerFast.from_pretrained("t5-small")

    text = "Here is a short article."
    text_2 = "Here is a long article, which exceeds model max length."
    texts = [text, text_2]

    inputs = tokenizer(texts, padding=True, truncation=True, max_length=10, return_tensors="pt")
    labels = inputs["input_ids"]
    decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels)