Question about Multilingual Tokenizers expected behaviours

andreoku · July 13, 2022, 10:39pm

Hi there,

I am trying to fine-tune a MBart model for text generation with a multilingual dataset and following the steps from the documentation (MBart and MBart-50) to tokenize different sequences from different languages, I am seeing an unexpected behaviour:

from transformers import MBartForConditionalGeneration, MBartTokenizer

tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"

inputs = tokenizer(example_english_phrase, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(expected_translation_romanian, return_tensors="pt")

From my understanding the source and target input ids should have different structures, as follows:

The source text format is X [eos, src_lang_code] where X is the source text. The target text format is [tgt_lang_code] X [eos] . bos is never used.

When I tried to reproduce this snippet, this is what I have:

print(inputs['input_ids'])
tensor([[  8274, 127873,  25916,      7,   8622,   2071,    438,  67485,     53,     187895,     23,  51712,      2, 250004]])

print(labels['input_ids'])
tensor([[ 47711,   7844, 127666,      8,  18347,  18147,   1362,    315,  42071,   36,  31563,   8454,  33796,    451,    346, 125577,      2, 250020]])

In both cases the token for the language are in the end of the sequence. Shouldn’t be in the end for the input and in the start for the target?

Thanks in advance!

Topic		Replies	Views
How to train mBart or any multilingual model for translation task Beginners	0	254	January 4, 2023
Weird behavior with mBART-50 and Spanish Models	0	301	July 30, 2021
Issue with MBart50 translation Beginners	2	622	February 24, 2021
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	19	June 17, 2025
What is the format of labels for mBART-50? Models	0	9	June 18, 2025

Question about Multilingual Tokenizers expected behaviours

Related topics