How to train an EncoderDecoderModel with different pretrained encoder and decoder?

When trying to initialize an EncoderDecoderModel with different pre-trained models, this kinda works without error:

import evaluate
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset, Dataset

import torch
from transformers import EncoderDecoderModel


src_tokenizer =  AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
tgt_tokenizer =  AutoTokenizer.from_pretrained("bert-base-uncased")

multibert = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "dbmdz/bert-base-german-uncased", "bert-base-uncased",
)

When we check the tokenizers, they are not aligned, esp. for the special tokens, e.g.

src_tokenizer.bos_token = src_tokenizer.cls_token
src_tokenizer.eos_token = src_tokenizer.sep_token
src_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

tgt_tokenizer.bos_token = tgt_tokenizer.cls_token
tgt_tokenizer.eos_token = tgt_tokenizer.sep_token
tgt_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(tgt_tokenizer.special_tokens_map, tgt_tokenizer.all_special_tokens, tgt_tokenizer.all_special_ids)

print(src_tokenizer.special_tokens_map, src_tokenizer.all_special_tokens, src_tokenizer.all_special_ids)

[out]:

({'bos_token': '[CLS]',
  'eos_token': '[SEP]',
  'unk_token': '[UNK]',
  'sep_token': '[SEP]',
  'pad_token': '[PAD]',
  'cls_token': '[CLS]',
  'mask_token': '[MASK]'},
 ['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'],
 [101, 102, 100, 0, 103])

({'bos_token': '[CLS]',
  'eos_token': '[SEP]',
  'unk_token': '[UNK]',
  'sep_token': '[SEP]',
  'pad_token': '[PAD]',
  'cls_token': '[CLS]',
  'mask_token': '[MASK]'},
 ['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'],
 [102, 103, 101, 0, 104])

Then it comes to the part when we need to set the tokenizers’ special tokens, what should we set for the EOS when it’s different ids for the encoders and decoders?

e.g.

# set special tokens

# I guess this must be the decoder's
multibert.config.decoder_start_token_id = tgt_tokenizer.bos_token_id

# And these are the encoder's tokenizer or decoder's?
multibert.config.eos_token_id = ???_tokenizer.eos_token_id

# This is the same so it doesn't matter, I guess.
multibert.config.pad_token_id = ???_tokenizer.pad_token_id

And when we have the collator, which tokenizer should we use?

data_collator = DataCollatorForSeq2Seq(???_tokenizer)

Finally when initializing the Seq2SeqTrainer, it asks for the tokenizer, which should it be encoder’s or decoder’s?

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=???_tokenizer,
    args=training_args,
    train_dataset=ds_train.with_format("torch"),
    eval_dataset=ds_valid.with_format("torch"),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    
)

Hello, I encountered the same issue. Did you manage to resolve it?

Any news? I am encoutering the same issue