What is loss function for T5

Could you please guide me how loss function for T5 is computed? I mean it is a seq2seq model. Suppose it must map a sequence of X tokens to Y tokens, but it generates Z tokens. How Y and Z are compared to calculate loss?

1 Like

T5 uses the regular cross-entropy loss (as any language model).

Suppose that you are fine-tuning T5 for translation, and you have the following training example:

* source sentence: "hello how are you"
* target sentence: "salut comment ça-va"

First, one needs to tokenize the sentences for the model using T5Tokenizer. Assuming that every word is tokenized into a single token, and we also add T5’s special token (namely </s> - which indicates the end of a sequence), we provide the follow inputs to the model:

* input tokens = [hello, how, are, you, </s>]
* label tokens = [salut, comment, ça, -, va, </s>]

Of course, we don’t provide these tokens as text to the model, but rather as integer IDs, which refer to row indices in an embedding matrix, so the actual inputs will look like:

* input_ids = [21820, 149, 33, 25, 1]
* labels = [20239, 1670, 3664, 18, 900, 1]

In that case, you first provide the input_ids to T5’s encoder, which will turn it into a tensor of shape (batch_size, seq_len, hidden_size). Next, T5’s decoder will predict, for each token of the target sequence, the correct next token. This happens as follows:

      salut         comment      ça          -       va   </s>       => label tokens

      20239          1670        3664        18      900    1        => labels

----------------------------------------------------------------------------------------------                   
                                 DECODER 
----------------------------------------------------------------------------------------------   

         0            20239      1670      3664   18  900  => decoder_input_ids                         

decoder_start_token   salut     comment     ça    -    va  => decoder input tokens

In other words, what happens is, we prepend the decoder inputs with a special token (the decoder start token - which for T5 is the padding token, with index 0), and then the decoder needs to predict (in parallel) that:

  • the token that follows the decoder start token is “salut”. Here, we compute the cross-entropy loss between the prediction of the model and the target token (which is “salut”).
  • the token that follows “salut” is “comment”. Here, we compute the cross-entropy loss between the prediction of the model and the target token (which is “comment”).
  • the token that follows “comment” is “ça”. Here, we compute the cross-entropy loss between the prediction of the model and the target token (which is “ça”).
  • etc.
  • the token that follows “va” is “</s>” (meaning, the end-of-sequence or EOS token). Here, we compute the cross-entropy loss between the prediction of the model and the target token (which is “</s>”).

In the code, this is done in one go, namely by comparing the logits of the model - which are of shape (batch_size, seq_len, vocab_size) - to the ground truth labels:

loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))

4 Likes

Thank you very much!

I read that for evaluation of sequences like machine translation, BLEU and ROUGE are used. Can’t this metric be used here too?

Yes, you can use it here too.

You can leverage HuggingFace Datasets for this.

Thanks,
Also, I have another minor question about T5 which could be obvious!
Does the embedding of words change during supervised fine-tuning of the model? I mean is it the change in embedding of words what the model learns from examples, or the learning is stored in another parts of the model?

Hi,

Yes, all parameters of the model can be slightly updated when fine-tuning the model. The parameters include the token embeddings, but also the weights of the self-attention layers, the language modeling head, etc.

Naive doubt ? :slight_smile: for task of fine tuning for text summarization, does same cross-entropy loss be used by default by huggingface seq2seq trainer on t5 versions ( small, base …) ?
Thank you

Yes, T5 uses the cross-entropy loss by default for language modeling as seen here

1 Like

Thank you very much.

What if the prediction adds an extra word at the beginning, like:

Foo salut comment ça-va

Does that mean the prediction is just entirely wrong? No partial credit for getting the sentence correct, but off-by-one?

Similarly, often there is more than one correct/reasonable translation, with a slightly different number of words. No partial credit?

If so, seems like a lot of potentially useful information is being wasted.

Even getting some of the single-word translations right could be partial credit.

In the context of sequence-to-sequence tasks like translation, where the model generates an entire sequence of tokens, evaluation can indeed be sensitive to the alignment of tokens between the predicted and reference sequences. This issue can arise when there are variations in the length or structure of the generated sequence compared to the reference sequence.

Here are some considerations:

  1. Exact Match Evaluation: If the evaluation metric considers only exact matches, then even a small deviation, such as an extra word at the beginning or a missing word, could result in the entire prediction being marked as incorrect. This approach is strict and does not provide partial credit for partially correct predictions.
  2. Token-Level Metrics: Some evaluation metrics take into account individual tokens and provide partial credit for correct tokens, even if the overall sequence is not an exact match. This is more forgiving and acknowledges that a model may produce a mostly correct translation with some deviations.
  3. BLEU and Similar Metrics: Metrics like BLEU (Bilingual Evaluation Understudy) are commonly used for machine translation. BLEU considers not only exact matches but also partial matches and provides partial credit for partially correct translations. It uses precision at the n-gram level, where n can be 1, 2, 3, etc., capturing partial matches.
  4. Semantic Evaluation: In some cases, semantic similarity metrics are used to evaluate the meaning of the generated text, allowing for more flexibility in word choice and sentence structure.

It’s important to choose or design an evaluation metric that aligns with the specific requirements and expectations of the task. While exact match metrics provide a clear measure of correctness, token-level and semantic metrics can offer a more nuanced assessment of the model’s performance, taking into account variations in length and word choice. The choice of evaluation metric depends on the goals of the task and the desired balance between precision and flexibility.
credit: GPT3.5

In other words, what happens is, we prepend the decoder inputs with a special token (the decoder start token - which for T5 is the padding token, with index 0), and then the decoder needs to predict (in parallel) that:

do we have to manually prepend the decoder_start_token? After applying T5Tokenizer I can see the end token 's integer value, i.e 1 present at the end in each input_ids and labels. But I dont see integer value 0 (corresponding to start token) at the of start of tokenized labels. Is it implicitly added while doing the inference/training?

Hi,

No the model does that automatically as seen here. The decoder_input_ids are created automatically by the model, based on the labels (by shifting them one position to the right + prepending the decoder_start_token_id).

1 Like

Thank you for the clarification