Hi,
Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.
That blog post is outdated, and we plan to make a new one that leverages the Seq2SeqTrainer.
It is possible to use the Seq2SeqTrainer for training EncoderDecoder models, as seen in my notebook here. Note that in that notebook, I’m training a VisionEncoderDecoderModel, but it’s similar to EncoderDecoderModel (just combining a vision encoder with a text decoder instead of combining a text encoder with a text decoder).
For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?
The model don’t communicate via words, they just communicate via tensors. The decoder will expose queries, while the encoder will expose keys and values during the cross-attention operation. One just needs to make sure they both have the same number of channels (hidden_size) in order to make dot products between vectors possible.
For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?
That’s a good question. A BERT model is an encoder-model, but actually it’s just a stack of self-attention layers (with fully-connected networks in between). A decoder itself is also just a stack of self-attention layers (with fully-connected networks in between). The only difference is that a decoder also has cross-attention layers.
So you can actually initialize the weights of a decoder with a weights of an encoder-only model (meaning initializing the weights of all self-attention layers and fully-connected networks). However, the weights of the cross-attention layer will be randomly initialized. Hence, one needs to fine-tune a Bert2Bert model on a dataset (like translation, summarization) in order for these cross-attention weights to be trained.