DataCollator for training mbart50 for translation with custom dataset

After going through the Hugging Face course, I’m trying to fine tune mBart50 for translation on a custom Dataset that I have created using the Dataset template.

I am using this tokenizer:
tokenizer = MBart50TokenizerFast.from_pretrained(“facebook/mbart-large-50-many-to-many-mmt”, src_lang=“en_XX”, tgt_lang = “ta_IN”)

This is my tokenize function:
def tokenize_function(example):

list_of_en_str = [item[‘en’] for item in example[‘translation’]]
list_of_ta_str = [item[‘ta’] for item in example[‘translation’]]

inp = tokenizer(list_of_en_str, truncation=True)

with tokenizer.as_target_tokenizer():
labels = tokenizer(list_of_ta_str, truncation=True)

inp[‘labels’] = labels[‘input_ids’]

return inp

I am applying tokenization to the dataset as given below:
tokenized_dataset =, batched=True)

This is my DataCollator:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

When I try to grab a few samples to check if the DataCollation is working…
samples = tokenized_dataset[“train”][:4]
samples = {
k: v for k, v in samples.items()

batch = data_collator(samples)

Here, I am getting this error:
ValueError: expected sequence of length 31 at dim 1 (got 38)
During handling of the above exception, another exception occurred:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.

Is DataCollatorWithPadding the right DataCollator to use for mBart50 translation?
Is there anything else I have missed out?
Please help me fix this.

Thanks in advance.