Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

nielsr · December 12, 2021, 9:40am

Hi,

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

That blog post is outdated, and we plan to make a new one that leverages the Seq2SeqTrainer.

It is possible to use the Seq2SeqTrainer for training EncoderDecoder models, as seen in my notebook here. Note that in that notebook, I’m training a VisionEncoderDecoderModel, but it’s similar to EncoderDecoderModel (just combining a vision encoder with a text decoder instead of combining a text encoder with a text decoder).

For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

The model don’t communicate via words, they just communicate via tensors. The decoder will expose queries, while the encoder will expose keys and values during the cross-attention operation. One just needs to make sure they both have the same number of channels (hidden_size) in order to make dot products between vectors possible.

For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

That’s a good question. A BERT model is an encoder-model, but actually it’s just a stack of self-attention layers (with fully-connected networks in between). A decoder itself is also just a stack of self-attention layers (with fully-connected networks in between). The only difference is that a decoder also has cross-attention layers.

So you can actually initialize the weights of a decoder with a weights of an encoder-only model (meaning initializing the weights of all self-attention layers and fully-connected networks). However, the weights of the cross-attention layer will be randomly initialized. Hence, one needs to fine-tune a Bert2Bert model on a dataset (like translation, summarization) in order for these cross-attention weights to be trained.

Topic		Replies	Views
Leveraging pre-trained checkpoints for summarization Models	33	3206	November 25, 2022
Training Bert2GPT2 model Summarization doesn't lead to acceptable results Models	0	462	December 8, 2021
Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT Models	1	756	May 12, 2022
BERT2BERT Notebook for Models without GenerationMixin 🤗Transformers	0	301	November 12, 2020
Bert2bert translator? 🤗Transformers	6	62	August 28, 2025

Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Related topics