Actually, the training script is a little bit huge, though I will share the most important parts:
Loading the model:
tokenizer = XLMRobertaTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
encoder_decoder_model = EncoderDecoderModel.from_encoder_decoder_pretrained(
"microsoft/Multilingual-MiniLM-L12-H384",
"microsoft/Multilingual-MiniLM-L12-H384",
)
encoder_decoder_model.config.decoder_start_token_id = tokenizer.bos_token_id
encoder_decoder_model.config.eos_token_id = tokenizer.eos_token_id
encoder_decoder_model.config.pad_token_id = tokenizer.pad_token_id
Single (source, target) example tokenization:
model_inputs = self.tokenizer(
examples["source"].strip(),
max_length=self.params["encoder_max_length"],
padding=False,
truncation=True,
)
targets = self.tokenizer(
examples["target"].strip(),
max_length=self.params["decoder_max_length"],
padding=False,
truncation=True,
)
model_inputs["labels"] = targets["input_ids"]
Then I create Seq2SeqTrainer, and train the model.
For inference, this is the generation config:
generation_config = dict(
max_length=None,
min_length=None,
do_sample=False,
early_stopping=True,
num_beams=1,
temperature=1.0,
top_k=None,
top_p=None,
length_penalty=1.0, # > 1.0 longer sequences, < 1.0 shorter sequences
num_return_sequences=1,
max_time=None, # in seconds
num_beam_groups=1,
output_scores=False,
)
call to generate():
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs.input_ids
generated_texts = model.generate(input_ids, **generation_config)
I hope this is enough to reproduce the issue. Thank you @ydshieh .