Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Ayham · December 11, 2021, 2:00pm

I am working on warm starting models for the summarization task based on @patrickvonplaten 's great blog: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. However, I have a few questions regarding these models, especially for Bert2Gpt2 and Bert2Bert models:

1- As we all know, the summarization task requires a sequence2sequence model. In @patrickvonplaten’s blog of warm-starting bert2gpt2 model :

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

2- For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

3- For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

Best Regards

nielsr · December 12, 2021, 9:40am

Hi,

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

That blog post is outdated, and we plan to make a new one that leverages the Seq2SeqTrainer.

It is possible to use the Seq2SeqTrainer for training EncoderDecoder models, as seen in my notebook here. Note that in that notebook, I’m training a VisionEncoderDecoderModel, but it’s similar to EncoderDecoderModel (just combining a vision encoder with a text decoder instead of combining a text encoder with a text decoder).

For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

The model don’t communicate via words, they just communicate via tensors. The decoder will expose queries, while the encoder will expose keys and values during the cross-attention operation. One just needs to make sure they both have the same number of channels (hidden_size) in order to make dot products between vectors possible.

For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

That’s a good question. A BERT model is an encoder-model, but actually it’s just a stack of self-attention layers (with fully-connected networks in between). A decoder itself is also just a stack of self-attention layers (with fully-connected networks in between). The only difference is that a decoder also has cross-attention layers.

So you can actually initialize the weights of a decoder with a weights of an encoder-only model (meaning initializing the weights of all self-attention layers and fully-connected networks). However, the weights of the cross-attention layer will be randomly initialized. Hence, one needs to fine-tune a Bert2Bert model on a dataset (like translation, summarization) in order for these cross-attention weights to be trained.

Ayham · December 12, 2021, 2:10pm

Thank you @nielsr for your clarification, it’s really clear. I read your notebook (Fine-tune TrOCR on the IAM Handwriting Database) and tried to understand what the TrOCRProcessor is and found that “it wraps ViTFeatureExtractor and RobertaTokenizer into a single instance to both extract the input features and decode the predicted token ids”. Then I found that you put “processor.feature_extractor” in the “tokenizer” argument in the Seq2SeqTrainer as follow:

trainer = Seq2SeqTrainer(
model=model,
tokenizer=processor.feature_extractor,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=default_data_collator,
)

That leads me to one more question:
In my case, Bert2GPT2 model for the summarization task, what should I put in the tokenizer argument in the Seq2SeqTrainer instead of “processor.feature_extractor”? the encoder tokenizer (Bert tokenizer) or the decoder tokenizer (GPT2 tokenizer)? Note that in the outdated blog (patrickvonplaten/bert2gpt2-cnn_dailymail-fp16 · Hugging Face) this argument (tokenizer) has been omitted.

Thanks again

nielsr · December 13, 2021, 8:46am

I believe the tokenizer argument is only required in case you haven’t batched the data yourself:

If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

I do have a notebook where I’ve updated the outdated blog post of @patrickvonplaten: Google Colab.

I believe you just need to update the process_data_to_model_inputs function, namely it should use BertTokenizer to prepare the inputs and RobertaTokenizer to prepare the labels. No tokenizer should be passed to the Seq2SeqTrainer.

Ayham · December 14, 2021, 1:16pm

Thank you @nielsr for your explanations and suggestions, I really appreciate it.

Ayham · December 14, 2021, 3:01pm

One more question please,

I pushed the model to the hub:
Ayham/roberta_gpt2_summarization_cnn_dailymail

it gives great results … I really appreciate your assistance.

But when I try to use the Model’s API, it gives me the following message:

Can’t load tokenizer using from_pretrained, please update its configuration: Can’t load tokenizer for ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’. Make sure that: - ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is a correct model identifier listed on ‘Models - Hugging Face’ (make sure ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is not a path to a local directory with something else, in that case) - or ‘Ayham/roberta_gpt2_summarization_cnn_dailymail’ is the correct path to a directory containing relevant tokenizer files

Why?!

Can I save the tokenizer file explicitly ? If yes, which one should I save, the encoder tokenizer (Roberta) or the decoder one (GPT2)?

If you have another way to enable the API to give results, please help me.

Thank you in advance

nielsr · December 14, 2021, 5:28pm

Hi,

looking at the files: Ayham/roberta_gpt2_summarization_cnn_dailymail at main

It indeed looks like only the weights (pytorch_model.bin) and model configuration (config.json) are uploaded, but not the tokenizer files.

You can upload the tokenizer files programmatically using the huggingface_hub library. First, make sure you have installed git-LFS and are logged into your HuggingFace account. In Colab, this can be done as follows:

!sudo apt-get install git-lfs
!git config --global user.email "your email"
!git config --global user.name "your username"
!huggingface-cli login

Next, you can do the following:

from transformers import RobertaTokenizer
from huggingface_hub import Repository

repo_url = "https://huggingface.co/Ayham/roberta_gpt2_summarization_cnn_dailymail"
repo = Repository(local_dir="tokenizer_files", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="Niels Rogge",
                  git_email="niels.rogge1@gmail.com",
                  use_auth_token=True,
)

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
tokenizer.save_pretrained("tokenizer_files")

repo.push_to_hub(commit_message="Upload tokenizer files")

Note that the Trainer can actually automatically push all files during/after training to the hub for you as seen here.

Ayham · December 15, 2021, 6:34pm

Thank you @nielsr for your useful explanations and suggestions, I really appreciate it.

ka05ar · June 3, 2024, 3:46pm

Hi, @nielsr

I am following your Fine-tune a warm-started encoder-decoder model (BERT2BERT) notebook but I am facing some error issue installing the transformers. I tried the following command:

!rm -r transformers
!git clone -b align_encoder_decoder_models https://github.com/NielsRogge/transformers.git
!cd transformers
!pip install -q ./transformers

But found the error:

Cloning into 'transformers'...
remote: Enumerating objects: 253620, done.
remote: Counting objects: 100% (134/134), done.
remote: Compressing objects: 100% (76/76), done.
remote: Total 253620 (delta 62), reused 99 (delta 44), pack-reused 253486
Receiving objects: 100% (253620/253620), 238.81 MiB | 13.36 MiB/s, done.
Resolving deltas: 100% (182909/182909), done.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Building wheel for transformers (pyproject.toml) ... done
  error: subprocess-exited-with-error
  
  × Building wheel for tokenizers (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... error
  ERROR: Failed building wheel for tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects```

It will benefit my project if you can help me fix this issue.

Thanks.

nielsr · June 3, 2024, 5:52pm

Hi,

I do not recommend to start from that branch which is out of date, just install Transformers using pip install transformers(Google Colab comes pre-installed with it).

ka05ar · June 4, 2024, 8:36pm

Thanks. That works.

ka05ar · June 9, 2024, 6:20am

Hi, @nielsr
I faced a new problem after finetuning.
Following your notebook I fine-tuned a seq2seq model where I used a BERT ([BanglaBERT] an Electra) model as encoder and [XGLM] as decoder using [BanglaParaphrase] data. But after fine-tuning the model always generates an empty string or garbage output. Now I do not understand where the problem is. Can you please help me find the bug in the code.
The tuned model: ka05ar/bb2bxglm_paraphrase_retrain · Hugging Face

Input-output for my code:

{‘input’: ‘সিপিও আহত থাকায় যুদ্ধ পরিচালনার দায়িত্ব এসে পড়েছিল সেম্প্রোনিয়াসের কাঁধে।’,
‘pred_target’: ‘’}

which should be something like this (should give the paraphrased sentence according to the input sentence in Bangla):
{‘input’: ‘সিপিও আহত থাকায় যুদ্ধ পরিচালনার দায়িত্ব এসে পড়েছিল সেম্প্রোনিয়াসের কাঁধে।’,
‘target’: ‘সিপিও কর্তৃক আহত হয়ে সেমপ্রোনিয়াসের কাঁধে যুদ্ধ পরিচালনার দায়িত্ব আসে।’}

I’ve been stuck on this problem for a long time and haven’t figured it out. Can you please help me?
Thanks.

Topic		Replies	Views
Training Bert2GPT2 model Summarization doesn't lead to acceptable results Models	0	452	December 8, 2021
Summarization taks, looking for clarifications before getting started Beginners	10	974	February 16, 2021
Leveraging pre-trained checkpoints for summarization Models	33	3158	November 25, 2022
BERT and GPT2 embedding questions Beginners	2	1533	December 28, 2022
EnocederDecoder training/prediction with two tokenizers Beginners	1	779	October 22, 2024

Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Input-output for my code:

Related topics