I have been following the instruction from the huggingface course which was told that rather than using padding for all dataset, its better just use datacollator with padding in each batch. So, i made it, my task is to fine-tune t5 model using my own dataset, but it has error when i tried to inspect or iterate one batch. The error said that i need to "activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. even though i just initialization truncation when i tokenize the dataset.
Can anyone know what should i fix it? thans for your help
this is my function to tokenize the data
and this is datacollatorwithpadding and initialise dataloader
Hi there! You should also read the section focused on Asking for help on the forums to lean how to format your code and error messages. Including screenshot is not really helpful as we can’t copy the code inside and try to execute it
As for your question, you are using the wrong data collator here. Your problem seems to be sequence to sequence, so you should use
DataCollatorForSeq2Seq which will also pad the labels (this probably the reason you get the error,
DataCollatorWithPadding does not touch the labels since it’s designed for sequence classification problems). See more on the translation or summarization course section.