Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

I am working on warm starting models for the summarization task based on @patrickvonplaten 's great blog: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. However, I have a few questions regarding these models, especially for Bert2Gpt2 and Bert2Bert models:

1- As we all know, the summarization task requires a sequence2sequence model. In @patrickvonplaten’s blog of warm-starting bert2gpt2 model :

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

2- For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

3- For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

Best Regards :slight_smile:

Hi,

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

That blog post is outdated, and we plan to make a new one that leverages the Seq2SeqTrainer.

It is possible to use the Seq2SeqTrainer for training EncoderDecoder models, as seen in my notebook here. Note that in that notebook, I’m training a VisionEncoderDecoderModel, but it’s similar to EncoderDecoderModel (just combining a vision encoder with a text decoder instead of combining a text encoder with a text decoder).

For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

The model don’t communicate via words, they just communicate via tensors. The decoder will expose queries, while the encoder will expose keys and values during the cross-attention operation. One just needs to make sure they both have the same number of channels (hidden_size) in order to make dot products between vectors possible.

For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

That’s a good question. A BERT model is an encoder-model, but actually it’s just a stack of self-attention layers (with fully-connected networks in between). A decoder itself is also just a stack of self-attention layers (with fully-connected networks in between). The only difference is that a decoder also has cross-attention layers.

So you can actually initialize the weights of a decoder with a weights of an encoder-only model (meaning initializing the weights of all self-attention layers and fully-connected networks). However, the weights of the cross-attention layer will be randomly initialized. Hence, one needs to fine-tune a Bert2Bert model on a dataset (like translation, summarization) in order for these cross-attention weights to be trained.

3 Likes

Thank you @nielsr for your clarification, it’s really clear. I read your notebook (Fine-tune TrOCR on the IAM Handwriting Database) and tried to understand what the TrOCRProcessor is and found that “it wraps ViTFeatureExtractor and RobertaTokenizer into a single instance to both extract the input features and decode the predicted token ids”. Then I found that you put “processor.feature_extractor” in the “tokenizer” argument in the Seq2SeqTrainer as follow:

trainer = Seq2SeqTrainer(
model=model,
tokenizer=processor.feature_extractor,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=default_data_collator,
)

That leads me to one more question:
In my case, Bert2GPT2 model for the summarization task, what should I put in the tokenizer argument in the Seq2SeqTrainer instead of “processor.feature_extractor”? the encoder tokenizer (Bert tokenizer) or the decoder tokenizer (GPT2 tokenizer)? Note that in the outdated blog (patrickvonplaten/bert2gpt2-cnn_dailymail-fp16 · Hugging Face) this argument (tokenizer) has been omitted.

Thanks again

I believe the tokenizer argument is only required in case you haven’t batched the data yourself:

If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

I do have a notebook where I’ve updated the outdated blog post of @patrickvonplaten: Google Colab.

I believe you just need to update the process_data_to_model_inputs function, namely it should use BertTokenizer to prepare the inputs and RobertaTokenizer to prepare the labels. No tokenizer should be passed to the Seq2SeqTrainer.

4 Likes

Thank you @nielsr for your explanations and suggestions, I really appreciate it.

One more question please,

I pushed the model to the hub:
Ayham/roberta_gpt2_summarization_cnn_dailymail

it gives great results … I really appreciate your assistance.

But when I try to use the Model’s API, it gives me the following message:

Can’t load tokenizer using from_pretrained, please update its configuration: Can’t load tokenizer for ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’. Make sure that: - ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is a correct model identifier listed on ‘Models - Hugging Face’ (make sure ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is not a path to a local directory with something else, in that case) - or ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is the correct path to a directory containing relevant tokenizer files

Why?!

Can I save the tokenizer file explicitly ? If yes, which one should I save, the encoder tokenizer (Roberta) or the decoder one (GPT2)?

If you have another way to enable the API to give results, please help me.

Thank you in advance :slight_smile:

Hi,

looking at the files: Ayham/roberta_gpt2_summarization_cnn_dailymail at main

It indeed looks like only the weights (pytorch_model.bin) and model configuration (config.json) are uploaded, but not the tokenizer files.

You can upload the tokenizer files programmatically using the huggingface_hub library. First, make sure you have installed git-LFS and are logged into your HuggingFace account. In Colab, this can be done as follows:

!sudo apt-get install git-lfs
!git config --global user.email "your email"
!git config --global user.name "your username"
!huggingface-cli login

Next, you can do the following:

from transformers import RobertaTokenizer
from huggingface_hub import Repository

repo_url = "https://huggingface.co/Ayham/roberta_gpt2_summarization_cnn_dailymail"
repo = Repository(local_dir="tokenizer_files", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="Niels Rogge",
                  git_email="niels.rogge1@gmail.com",
                  use_auth_token=True,
)

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
tokenizer.save_pretrained("tokenizer_files")

repo.push_to_hub(commit_message="Upload tokenizer files")

Note that the Trainer can actually automatically push all files during/after training to the hub for you as seen here.

2 Likes

Thank you @nielsr for your useful explanations and suggestions, I really appreciate it.