From Transformers Version v4.12.0 onwards, The example colab BERT2BERT is wrong. (Things to keep in mind when using from transformers import EncoderDecoderModel)

colab : Google Colab

Below is the log during training

FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.

git-hub : huggingface

def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
    Shift input ids one token to the right.
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
    if decoder_start_token_id is None:
        raise ValueError("Make sure to set the decoder_start_token_id attribute of the model's configuration.")
    shifted_input_ids[:, 0] = decoder_start_token_id

    if pad_token_id is None:
        raise ValueError("Make sure to set the pad_token_id attribute of the model's configuration.")
    # replace possible -100 values in labels by `pad_token_id`
    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

    return shifted_input_ids

In colab, decoder_input_ids are entered separately. However, as the version goes up, now you only need to enter labels without decoder_input_ids. The problem here is in the labels, and if you decode() the tokenized labels, you can confirm that there is a [CLS] token. In that case, according to the above shift_tokens_right() function, the [CLS] token of decoder_start_token_id is duplicated in decoder_input_ids, and eventually decoder_input_ids becomes [CLS][CLS]vocab_tokens[SEP][PAD]… Then, the problem is that [CLS], which does not even need labels, is added in front, and decoder_input_ids has duplicate [CLS], so model learning becomes a mess. I have experienced it. And this is the result I discovered while investigating why these results occur. The solution is simple. Just delete the [CLS] token that exists in front of labels.
Below is an example of deleting the [CLS] token that exists in front of labels.

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["long_text"], padding="max_length", truncation=True, max_length=encoder_max_length, return_tensors='pt')
  outputs = tokenizer(batch["summary"], padding="max_length", truncation=True, max_length=decoder_max_length, return_tensors='pt')

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  # batch["decoder_input_ids"] = outputs.input_ids
  # batch["decoder_attention_mask"] = outputs.attention_mask
  output_ids = outputs.input_ids
  shifted_input_ids = output_ids.new_zeros(output_ids.shape)
  shifted_input_ids[:, :-1] = output_ids[:, 1:].clone()   # del CLS token
  shifted_input_ids[:, -1] = tokenizer.pad_token_id   # append [PAD] token
  batch["labels"] = shifted_input_ids

  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

I did this and the model learned properly. I don’t know if it’s right to post this here, but I hope that people who are lost like me when using the “from transformers import EncoderDecoderModel” model will find their way by reading this post.

1 Like