Hi there,
I am trying to fine-tune a MBart model for text generation with a multilingual dataset and following the steps from the documentation (MBart and MBart-50) to tokenize different sequences from different languages, I am seeing an unexpected behaviour:
from transformers import MBartForConditionalGeneration, MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
inputs = tokenizer(example_english_phrase, return_tensors="pt")
with tokenizer.as_target_tokenizer():
labels = tokenizer(expected_translation_romanian, return_tensors="pt")
From my understanding the source and target input ids should have different structures, as follows:
The source text format is
X [eos, src_lang_code]
whereX
is the source text. The target text format is[tgt_lang_code] X [eos]
.bos
is never used.
When I tried to reproduce this snippet, this is what I have:
print(inputs['input_ids'])
tensor([[ 8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2, 250004]])
print(labels['input_ids'])
tensor([[ 47711, 7844, 127666, 8, 18347, 18147, 1362, 315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2, 250020]])
In both cases the token for the language are in the end of the sequence. Shouldn’t be in the end for the input and in the start for the target?
Thanks in advance!