Hi,
After going through the Hugging Face course, I’m trying to fine tune mBart50 for translation on a custom Dataset that I have created using the Dataset template.
I am using this tokenizer:
tokenizer = MBart50TokenizerFast.from_pretrained(“facebook/mbart-large-50-many-to-many-mmt”, src_lang=“en_XX”, tgt_lang = “ta_IN”)
This is my tokenize function:
def tokenize_function(example):
list_of_en_str = [item[‘en’] for item in example[‘translation’]]
list_of_ta_str = [item[‘ta’] for item in example[‘translation’]]
inp = tokenizer(list_of_en_str, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(list_of_ta_str, truncation=True)
inp[‘labels’] = labels[‘input_ids’]
return inp
I am applying tokenization to the dataset as given below:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
This is my DataCollator:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
When I try to grab a few samples to check if the DataCollation is working…
samples = tokenized_dataset[“train”][:4]
samples = {
k: v for k, v in samples.items()
}
batch = data_collator(samples)
Here, I am getting this error:
ValueError: expected sequence of length 31 at dim 1 (got 38)
During handling of the above exception, another exception occurred:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.
Is DataCollatorWithPadding the right DataCollator to use for mBart50 translation?
Is there anything else I have missed out?
Please help me fix this.
Thanks in advance.