Padding in datasets

I usually use padding in batches before I get into the datasets library.

I found that support batched and batch_size. But it seems that only padding all examples (in to fixed length or max_length make sense with subsequent batch_size in creating DataLoader.

Otherwise, if I use map function like lambda x: tokenizer(x["sentence"], padding=True, truncation=True) I get errors like RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [53] at entry 8 when iterating the dataloader since I could not find a way to iterating the same batches in in dataloader.

Am I right?

That’s because padding=True makes the tokenization pad to the longest sequence in the batch. Therefore two batches may have different length

Then having all the examples in the dataset padded to the same length could slow down training, right?

1 Like

Padding all the examples to the same length makes the training slower compared to training with padding to the maximum length per batch.
You can have a data_collator in your pytorch dataloader that does the padding to the maximum length per batch.

@lhoestq do you have an example of that?

@maximin what was your solution in place of lambda entry: self.tokenizer(entry[ padding=True,)?