Fine-tuning for translation with facebook mbart-large-50

I am trying to use the facebook mbart-large-50 model to fine-tune for en-ro translation task.
raw_datasets = load_dataset(“wmt16”, “ro-en”)

Referring to the notebook, I have modified the code as follows.

tokenizer = MBart50TokenizerFast.from_pretrained(‘facebook/mbart-large-50’, src_lang=“en_XX”, tgt_lang=“ro_RO”)

max_input_length = 128
max_target_length = 128
source_lang = “en_XX”
target_lang = “ro_RO”

I am trying to use the facebook mbart-large-50 model to fine-tune for en-ro translation task.
raw_datasets = load_dataset(“wmt16”, “ro-en”)

Referring to the notebook, I have modified the code as follows.

Please let me know the following.

  1. tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
    Is the above step necessary? I am unable to run it as get the following error.

TypeError: Provided function which is applied to all elements of table returns a dict of types [<class ‘torch.Tensor’>, <class ‘torch.Tensor’>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>).

  1. If the above step is bypassed the below error is given during training. ie. trainer.train()

We have to pad the labels before calling tokenizer.pad as this method won’t pad them and needs them of the

  • 276 # same length to return tensors.*

AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’

Please let me know the correct way of passing the values.

Thank you.

2 Likes

Hi ,
I am getting the same error AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’. Did you solve it. Please help me with this…