When trying to initialize an EncoderDecoderModel
with different pre-trained models, this kinda works without error:
import evaluate
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset, Dataset
import torch
from transformers import EncoderDecoderModel
src_tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
tgt_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
multibert = EncoderDecoderModel.from_encoder_decoder_pretrained(
"dbmdz/bert-base-german-uncased", "bert-base-uncased",
)
When we check the tokenizers, they are not aligned, esp. for the special tokens, e.g.
src_tokenizer.bos_token = src_tokenizer.cls_token
src_tokenizer.eos_token = src_tokenizer.sep_token
src_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tgt_tokenizer.bos_token = tgt_tokenizer.cls_token
tgt_tokenizer.eos_token = tgt_tokenizer.sep_token
tgt_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
print(tgt_tokenizer.special_tokens_map, tgt_tokenizer.all_special_tokens, tgt_tokenizer.all_special_ids)
print(src_tokenizer.special_tokens_map, src_tokenizer.all_special_tokens, src_tokenizer.all_special_ids)
[out]:
({'bos_token': '[CLS]',
'eos_token': '[SEP]',
'unk_token': '[UNK]',
'sep_token': '[SEP]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'mask_token': '[MASK]'},
['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'],
[101, 102, 100, 0, 103])
({'bos_token': '[CLS]',
'eos_token': '[SEP]',
'unk_token': '[UNK]',
'sep_token': '[SEP]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'mask_token': '[MASK]'},
['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'],
[102, 103, 101, 0, 104])
Then it comes to the part when we need to set the tokenizers’ special tokens, what should we set for the EOS when it’s different ids for the encoders and decoders?
e.g.
# set special tokens
# I guess this must be the decoder's
multibert.config.decoder_start_token_id = tgt_tokenizer.bos_token_id
# And these are the encoder's tokenizer or decoder's?
multibert.config.eos_token_id = ???_tokenizer.eos_token_id
# This is the same so it doesn't matter, I guess.
multibert.config.pad_token_id = ???_tokenizer.pad_token_id
And when we have the collator, which tokenizer should we use?
data_collator = DataCollatorForSeq2Seq(???_tokenizer)
Finally when initializing the Seq2SeqTrainer, it asks for the tokenizer, which should it be encoder’s or decoder’s?
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=???_tokenizer,
args=training_args,
train_dataset=ds_train.with_format("torch"),
eval_dataset=ds_valid.with_format("torch"),
data_collator=data_collator,
compute_metrics=compute_metrics,
)