How to deal with DataCollator and DataLoaders in Huggingface?

3r1c · February 2, 2023, 3:03pm

I have issues combining a DataLoader and DataCollator. The following code with DataCollatorWithPadding results in a ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. when I want to iterate through the batches.

from torch.utils.data.dataloader import DataLoader
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, 
collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=data_collator)
for epoch in range(2):
    model.train()
    for step, batch in enumerate(train_dataloader):          
          outputs = model(**batch)
          loss = outputs.loss

However, I found annother approach where I changed the DataCollator to lambda x: x Then it gives me a TypeError: DistilBertForSequenceClassification object argument after ** must be a mapping, not list

from torch.utils.data.dataloader import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=lambda x: x  )
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=lambda x: x)
for epoch in range(2):
    model.train()
    for step, batch in enumerate(train_dataloader):          
          outputs = model(**batch)
          loss = outputs.loss

For reproducability and for the rest of the code I provide you a Jupyter Notebook on Google Colab. You find the errors at the bottom of the notebook. Using the trainer class in my particular scenario is no option.

Colab Notebook

Topic		Replies	Views
Can't iterate a DataLoader 🤗Datasets	3	1420	February 25, 2022
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2510	May 9, 2022
Can't iterate through the data loader object after dynamic padding Beginners	1	844	July 8, 2022
Not able to add data_collator to Trainer 🤗Transformers	1	563	May 13, 2024

How to deal with DataCollator and DataLoaders in Huggingface?

Related topics