Seq2Seq Loss computation in Trainer

Hello, I’m using the EncoderDecoderModel to do the summarization task.
I have questions on the loss computation in Trainer class.

For text summarization task, as far as I know, the encoder input is the content, the decoder input and the label is the summary.

The EncoderDecoderModel utilizes CausalLMModel as the Decoder model. In the CausalLMModel, the loss is computed by shifting the labels and inputs so that the decoder can predict the next token based on the decoder inputs.

However, in Trainer class, the labels is first poped out from inputs dictionary (transformers/trainer.py at master · huggingface/transformers · GitHub). Without labels, the loss will not be calculated in the decoder model (transformers/modeling_bert.py at master · huggingface/transformers · GitHub). The loss is calculated in Trainer Line 1887. This calculation is different from the calculation in the decoder model forward. There is no shift in labels and decoder inputs.

My question is how to define decoder inputs and labels in EncoderDecoderModel for text summarization task? How to use Trainer to fine-tune EncoderDecoderModel for text summarization task?

Thank you.

Note that the loss is only popped if you use label smoothing. The default behavior is indeed that the loss is calculated within the forward.

Hi,

@patrickvonplaten has an example of this here: patrickvonplaten/bert2gpt2-cnn_dailymail-fp16 · Hugging Face

Basically, you need to adapt the Trainer a bit in order for it to work with the EncoderDecoderModel framework. However, we are planning to improve the encoder decoder framework (as we’ve recently also added SpeechEncoderDecoderModel and VisionEncoderDecoderModel) for it to work with the Trainer by default.

Thanks for your reply!
You’re right, the default behavior is exactly what I wanted.
Just one more concern, why with label smoothing the shift is not applied?

Thanks in advance.

Thanks for your reply.
If I understand right, I need to install another specific version of Huggingface in order to use the slightly changed version of Trainer?

Hi @GuillaumeZ, I should actually update my previous comment - the example I linked to is outdated.

I suggest to read the following blog post: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models

It includes a good overview (as well as links to notebooks) on how to fine-tune warm-started encoder-decoder models using the Seq2SeqTrainer (which is an extension of the Trainer).

Note that one still needs to define the decoder_input_ids himself when using a decoder like BertLMHeadModel or RobertaLMHeadModel. This will be updated in a PR I’m currently working on, such that the decoder_input_ids will be created automatically based on the labels provided by the user.

Thank you very much. @nielsr I will read this blog.

Hi @nielsr

This will be updated in a PR I’m currently working on, such that the decoder_input_ids will be created automatically based on the labels provided by the user.

Is this the current behavior for BART? I find it weird that the output loss is the same regardless of providing or not the decoder_input_ids, unless these are being generated internally in the model from the provided labels. However, if that’s the case, this behavior seems undocumented. The docs say that:

If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper. (BART)

Minimal code to reproduce the issue:

from transformers import AutoConfig, AutoModelForSeq2SeqLM
import torch
from torch import tensor

config = AutoConfig.from_pretrained(
            'facebook/bart-large',
            cache_dir=None,
            revision='main',
            use_auth_token=None,
        )
bart = AutoModelForSeq2SeqLM.from_pretrained(
            'facebook/bart-large',
            from_tf=False,
            config=config,
            cache_dir=None,
            revision='main',
            use_auth_token=None,
       )
          
input_ids = tensor([[0, 510, 1290, 1043, 3019, 6, 1261, 36, 16256, 43, 3]])
attention_mask = torch.ones_like(input_ids)
labels = tensor([[0, 24476,  6302, 16629, 5,  5794, 160, 17223, 1043,  3019, 2]])
decoder_input_ids = tensor([[2, 0, 24476, 6302, 16629, 5, 5794, 160, 17223, 1043,  3019]])

bart.eval()  # just to make it deterministic

# loss when decoder_input_ids are provided
out1 = bart(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, labels=labels)
loss1 = out1.loss

# loss when decoder_input_ids are not provided
out2 = bart(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=None, labels=labels)
loss2 = out2.loss

print('Loss with decoder_input_ids   :', loss1.item())
print('Loss without decoder_input_ids:', loss2.item())

Yes, this is to mimic other models such as BART and T5, which also automatically create the decoder_input_ids based on the labels.

That’s actually a mistake in the documentation, it should be “by shifting the labels” instead of “by shifting the input_ids”. Can you open a PR to fix this?

And regarding the loss difference: that’s actually exactly what we want, right :smiley: we would like to get the exact same loss if users don’t provide decoder_input_ids themselves, but only labels. In that case, the model should create exactly the decoder_input_ids as the ones you provided. Seems like the implementation is correct :slight_smile:

1 Like

That’s actually a mistake in the documentation, it should be “by shifting the labels” instead of “by shifting the input_ids”. Can you open a PR to fix this?

Sure, I will :wink:

Seems like the implementation is correct :slight_smile:

Yes, now everything makes sense, thank you!