From Transformers Version v4.12.0 onwards, The example colab BERT2BERT is wrong. (Things to keep in mind when using from transformers import EncoderDecoderModel)

dragonkue · February 16, 2024, 3:45pm

Below is the log during training

FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.

git-hub : huggingface

github.com

huggingface/transformers/blob/main/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

# coding=utf-8
# Copyright 2018 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Classes to support Encoder-Decoder architectures"""


import gc
import inspect
import os

This file has been truncated. show original


def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
    """
    Shift input ids one token to the right.
    """
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
    if decoder_start_token_id is None:
        raise ValueError("Make sure to set the decoder_start_token_id attribute of the model's configuration.")
    shifted_input_ids[:, 0] = decoder_start_token_id

    if pad_token_id is None:
        raise ValueError("Make sure to set the pad_token_id attribute of the model's configuration.")
    # replace possible -100 values in labels by `pad_token_id`
    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

    return shifted_input_ids

In colab, decoder_input_ids are entered separately. However, as the version goes up, now you only need to enter labels without decoder_input_ids. The problem here is in the labels, and if you decode() the tokenized labels, you can confirm that there is a [CLS] token. In that case, according to the above shift_tokens_right() function, the [CLS] token of decoder_start_token_id is duplicated in decoder_input_ids, and eventually decoder_input_ids becomes [CLS][CLS]vocab_tokens[SEP][PAD]… Then, the problem is that [CLS], which does not even need labels, is added in front, and decoder_input_ids has duplicate [CLS], so model learning becomes a mess. I have experienced it. And this is the result I discovered while investigating why these results occur. The solution is simple. Just delete the [CLS] token that exists in front of labels.
Below is an example of deleting the [CLS] token that exists in front of labels.


def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["long_text"], padding="max_length", truncation=True, max_length=encoder_max_length, return_tensors='pt')
  outputs = tokenizer(batch["summary"], padding="max_length", truncation=True, max_length=decoder_max_length, return_tensors='pt')

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  # batch["decoder_input_ids"] = outputs.input_ids
  # batch["decoder_attention_mask"] = outputs.attention_mask
  output_ids = outputs.input_ids
  shifted_input_ids = output_ids.new_zeros(output_ids.shape)
  shifted_input_ids[:, :-1] = output_ids[:, 1:].clone()   # del CLS token
  shifted_input_ids[:, -1] = tokenizer.pad_token_id   # append [PAD] token
  batch["labels"] = shifted_input_ids

  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

I did this and the model learned properly. I don’t know if it’s right to post this here, but I hope that people who are lost like me when using the “from transformers import EncoderDecoderModel” model will find their way by reading this post.

Topic		Replies	Views
Seq2Seq Loss computation in Trainer Beginners	9	6010	October 28, 2021
What should be shifted for decoder input for Bart Beginners	1	329	July 8, 2021
Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"? 🤗Transformers	5	3349	December 29, 2020
Could I inference the Encoder-Decoder model without specify "decoder_input_ids"? 🤗Transformers	4	2459	May 1, 2021
How to train an EncoderDecoderModel with different pretrained encoder and decoder? 🤗Transformers	2	418	April 2, 2024

From Transformers Version v4.12.0 onwards, The example colab BERT2BERT is wrong. (Things to keep in mind when using from transformers import EncoderDecoderModel)

Related topics