I have issues combining a DataLoader and DataCollator. The following code with DataCollatorWithPadding results in a ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
when I want to iterate through the batches.
from torch.utils.data.dataloader import DataLoader
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16,
collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=data_collator)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
However, I found annother approach where I changed the DataCollator to lambda x: x
Then it gives me a TypeError: DistilBertForSequenceClassification object argument after ** must be a mapping, not list
from torch.utils.data.dataloader import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=lambda x: x )
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=lambda x: x)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
For reproducability and for the rest of the code I provide you a Jupyter Notebook on Google Colab. You find the errors at the bottom of the notebook. Using the trainer class in my particular scenario is no option.