Padding in datasets

maximin · November 29, 2020, 8:45am

I usually use padding in batches before I get into the datasets library.

I found that dataset.map support batched and batch_size. But it seems that only padding all examples (in dataset.map) to fixed length or max_length make sense with subsequent batch_size in creating DataLoader.

Otherwise, if I use map function like lambda x: tokenizer(x["sentence"], padding=True, truncation=True) I get errors like RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [53] at entry 8 when iterating the dataloader since I could not find a way to iterating the same batches in datasets.map in dataloader.

Am I right?

lhoestq · December 28, 2020, 1:20pm

That’s because padding=True makes the tokenization pad to the longest sequence in the batch. Therefore two batches may have different length

maximin · March 21, 2021, 2:38pm

Then having all the examples in the dataset padded to the same length could slow down training, right?

lhoestq · March 22, 2021, 1:31pm

Padding all the examples to the same length makes the training slower compared to training with padding to the maximum length per batch.
You can have a data_collator in your pytorch dataloader that does the padding to the maximum length per batch.

RylanSchaeffer · October 11, 2021, 8:08pm

@lhoestq do you have an example of that?

RylanSchaeffer · October 11, 2021, 8:12pm

@maximin what was your solution in place of lambda entry: self.tokenizer(entry[ padding=True,)?

lhoestq · October 21, 2021, 8:49am

padding=True in the data_collator does the padding to the maximum length of the batch, so that’s the way to go

But if you want to do the tokenization in map instead of in the data collator you can, but you must add an extra padding step in the data_collator to make sure all the examples in each batch have the same length

Topic		Replies	Views
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5125	June 21, 2023
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10417	August 10, 2023
Not sure why padding isn't working for me Beginners	2	1598	January 22, 2021
Odd dataset.map() behavior with PyTorch dataloader 🤗Datasets	2	231	March 25, 2024
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2526	May 9, 2022

Padding in datasets

Related topics