Seq2Seq Loss computation in Trainer

GuillaumeZ · October 22, 2021, 2:43am

Hello, I’m using the EncoderDecoderModel to do the summarization task.
I have questions on the loss computation in Trainer class.

For text summarization task, as far as I know, the encoder input is the content, the decoder input and the label is the summary.

The EncoderDecoderModel utilizes CausalLMModel as the Decoder model. In the CausalLMModel, the loss is computed by shifting the labels and inputs so that the decoder can predict the next token based on the decoder inputs.

However, in Trainer class, the labels is first poped out from inputs dictionary (transformers/trainer.py at master · huggingface/transformers · GitHub). Without labels, the loss will not be calculated in the decoder model (transformers/modeling_bert.py at master · huggingface/transformers · GitHub). The loss is calculated in Trainer Line 1887. This calculation is different from the calculation in the decoder model forward. There is no shift in labels and decoder inputs.

My question is how to define decoder inputs and labels in EncoderDecoderModel for text summarization task? How to use Trainer to fine-tune EncoderDecoderModel for text summarization task?

Thank you.

BramVanroy · October 22, 2021, 9:28am

Note that the loss is only popped if you use label smoothing. The default behavior is indeed that the loss is calculated within the forward.

nielsr · October 22, 2021, 11:39am

Hi,

@patrickvonplaten has an example of this here: patrickvonplaten/bert2gpt2-cnn_dailymail-fp16 · Hugging Face

Basically, you need to adapt the Trainer a bit in order for it to work with the EncoderDecoderModel framework. However, we are planning to improve the encoder decoder framework (as we’ve recently also added SpeechEncoderDecoderModel and VisionEncoderDecoderModel) for it to work with the Trainer by default.

GuillaumeZ · October 23, 2021, 1:24am

Thanks for your reply!
You’re right, the default behavior is exactly what I wanted.
Just one more concern, why with label smoothing the shift is not applied?

Thanks in advance.

GuillaumeZ · October 23, 2021, 1:33am

Thanks for your reply.
If I understand right, I need to install another specific version of Huggingface in order to use the slightly changed version of Trainer?

nielsr · October 25, 2021, 2:03pm

Hi @GuillaumeZ, I should actually update my previous comment - the example I linked to is outdated.

I suggest to read the following blog post: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models

It includes a good overview (as well as links to notebooks) on how to fine-tune warm-started encoder-decoder models using the Seq2SeqTrainer (which is an extension of the Trainer).

Note that one still needs to define the decoder_input_ids himself when using a decoder like BertLMHeadModel or RobertaLMHeadModel. This will be updated in a PR I’m currently working on, such that the decoder_input_ids will be created automatically based on the labels provided by the user.

GuillaumeZ · October 26, 2021, 9:57am

Thank you very much. @nielsr I will read this blog.

dpernes · October 28, 2021, 11:30am

Hi @nielsr

This will be updated in a PR I’m currently working on, such that the decoder_input_ids will be created automatically based on the labels provided by the user.

Is this the current behavior for BART? I find it weird that the output loss is the same regardless of providing or not the decoder_input_ids, unless these are being generated internally in the model from the provided labels. However, if that’s the case, this behavior seems undocumented. The docs say that:

If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper. (BART)

Minimal code to reproduce the issue:

from transformers import AutoConfig, AutoModelForSeq2SeqLM
import torch
from torch import tensor

config = AutoConfig.from_pretrained(
            'facebook/bart-large',
            cache_dir=None,
            revision='main',
            use_auth_token=None,
        )
bart = AutoModelForSeq2SeqLM.from_pretrained(
            'facebook/bart-large',
            from_tf=False,
            config=config,
            cache_dir=None,
            revision='main',
            use_auth_token=None,
       )
          
input_ids = tensor([[0, 510, 1290, 1043, 3019, 6, 1261, 36, 16256, 43, 3]])
attention_mask = torch.ones_like(input_ids)
labels = tensor([[0, 24476,  6302, 16629, 5,  5794, 160, 17223, 1043,  3019, 2]])
decoder_input_ids = tensor([[2, 0, 24476, 6302, 16629, 5, 5794, 160, 17223, 1043,  3019]])

bart.eval()  # just to make it deterministic

# loss when decoder_input_ids are provided
out1 = bart(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, labels=labels)
loss1 = out1.loss

# loss when decoder_input_ids are not provided
out2 = bart(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=None, labels=labels)
loss2 = out2.loss

print('Loss with decoder_input_ids   :', loss1.item())
print('Loss without decoder_input_ids:', loss2.item())

nielsr · October 28, 2021, 12:27pm

Yes, this is to mimic other models such as BART and T5, which also automatically create the decoder_input_ids based on the labels.

That’s actually a mistake in the documentation, it should be “by shifting the labels” instead of “by shifting the input_ids”. Can you open a PR to fix this?

And regarding the loss difference: that’s actually exactly what we want, right we would like to get the exact same loss if users don’t provide decoder_input_ids themselves, but only labels. In that case, the model should create exactly the decoder_input_ids as the ones you provided. Seems like the implementation is correct

dpernes · October 28, 2021, 1:01pm

That’s actually a mistake in the documentation, it should be “by shifting the labels” instead of “by shifting the input_ids”. Can you open a PR to fix this?

Sure, I will

Seems like the implementation is correct

Yes, now everything makes sense, thank you!

Topic		Replies	Views
T5 fine tuning, loss difference when using labels and decoder_input_ids 🤗Transformers	2	1177	October 12, 2020
Encoder Decoder Loss 🤗Transformers	6	9015	October 14, 2021
Understanding the encoder-decoder loss calculation VS CLM loss Beginners	0	345	February 21, 2024
From Transformers Version v4.12.0 onwards, The example colab BERT2BERT is wrong. (Things to keep in mind when using from transformers import EncoderDecoderModel) 🤗Transformers	0	270	February 16, 2024
Popping `inputs[labels]` when self.label_smoother is not None (in trainer.py) Beginners	2	1289	November 11, 2021

Seq2Seq Loss computation in Trainer

Related topics