DataCollator for training mbart50 for translation with custom dataset

Preetha · June 24, 2021, 4:44pm

Hi,
After going through the Hugging Face course, I’m trying to fine tune mBart50 for translation on a custom Dataset that I have created using the Dataset template.

I am using this tokenizer:
tokenizer = MBart50TokenizerFast.from_pretrained(“facebook/mbart-large-50-many-to-many-mmt”, src_lang=“en_XX”, tgt_lang = “ta_IN”)

This is my tokenize function:
def tokenize_function(example):

list_of_en_str = [item[‘en’] for item in example[‘translation’]]
list_of_ta_str = [item[‘ta’] for item in example[‘translation’]]

inp = tokenizer(list_of_en_str, truncation=True)

with tokenizer.as_target_tokenizer():
labels = tokenizer(list_of_ta_str, truncation=True)

inp[‘labels’] = labels[‘input_ids’]

return inp

I am applying tokenization to the dataset as given below:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

This is my DataCollator:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

When I try to grab a few samples to check if the DataCollation is working…
samples = tokenized_dataset[“train”][:4]
samples = {
k: v for k, v in samples.items()
}

batch = data_collator(samples)

Here, I am getting this error:
ValueError: expected sequence of length 31 at dim 1 (got 38)
During handling of the above exception, another exception occurred:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.

Is DataCollatorWithPadding the right DataCollator to use for mBart50 translation?
Is there anything else I have missed out?
Please help me fix this.

Thanks in advance.

Topic		Replies	Views
Defining a custom dataset for fine-tuning translation Beginners	4	5088	July 10, 2021
Tokenizer to dataset to datacollator Beginners	1	1321	April 28, 2022
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2513	May 9, 2022
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7579	January 26, 2023

DataCollator for training mbart50 for translation with custom dataset

Related topics