Fine-tuning for translation with facebook mbart-large-50

Aloka · August 26, 2021, 10:40pm

I am trying to use the facebook mbart-large-50 model to fine-tune for en-ro translation task.
raw_datasets = load_dataset(“wmt16”, “ro-en”)

Referring to the notebook, I have modified the code as follows.

tokenizer = MBart50TokenizerFast.from_pretrained(‘facebook/mbart-large-50’, src_lang=“en_XX”, tgt_lang=“ro_RO”)

max_input_length = 128
max_target_length = 128
source_lang = “en_XX”
target_lang = “ro_RO”

I am trying to use the facebook mbart-large-50 model to fine-tune for en-ro translation task.
raw_datasets = load_dataset(“wmt16”, “ro-en”)

Referring to the notebook, I have modified the code as follows.

Aloka:

tokenizer = MBart50TokenizerFast.from_pretrained(‘facebook/mbart-large-50’, src_lang=“en_XX”, tgt_lang=“ro_RO”)

max_input_length = 128
max_target_length = 128
source_lang = “en_XX”
target_lang = “ro_RO”

def preprocess_function(examples):
inputs = [source_lang+’ ‘+ex[‘en’] for ex in examples[“translation”]]
targets = [target_lang+’ '+ex[‘ro’] for ex in examples[“translation”]]
model_inputs = tokenizer(inputs, max_length=max_input_length, padding=True, truncation=True, return_tensors="pt")


# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=max_target_length, padding=True, truncation=True, return_tensors="pt").input_ids


return model_inputs
Please let me know the following.

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
Is the above step necessary? I am unable to run it as get the following error.

TypeError: Provided function which is applied to all elements of table returns a dict of types [<class ‘torch.Tensor’>, <class ‘torch.Tensor’>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>).

If the above step is committed the below error is given during training. ie. trainer.train()

We have to pad the labels before calling tokenizer.pad as this method won’t pad them and needs them of the

276 # same length to return tensors.*

AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’

Please let me know the correct way of passing the values.

Thank you.

Please let me know the following.

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
Is the above step necessary? I am unable to run it as get the following error.

TypeError: Provided function which is applied to all elements of table returns a dict of types [<class ‘torch.Tensor’>, <class ‘torch.Tensor’>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>).

If the above step is bypassed the below error is given during training. ie. trainer.train()

We have to pad the labels before calling tokenizer.pad as this method won’t pad them and needs them of the

276 # same length to return tensors.*

AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’

Please let me know the correct way of passing the values.

Thank you.

sauce1611 · March 16, 2024, 12:22pm

Hi ,
I am getting the same error AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’. Did you solve it. Please help me with this…

Topic		Replies	Views
Translation takes too long - from fine-tuned mbart-large-50 model Beginners	0	406	September 7, 2021
How to prepare data for mBART50 multilingual (many-to-many) fine-tuning? Models	1	19	June 17, 2025
Mbart finetuning Models	0	676	July 29, 2021
Facebook mbart multilingual translation Beginners	0	499	February 1, 2023
DataCollator for training mbart50 for translation with custom dataset Beginners	0	346	June 24, 2021

Fine-tuning for translation with facebook mbart-large-50

Related topics