I usually use padding in batches before I get into the datasets library. I found that dataset.map support batched and batch_size. But it seems that only padding all examples (in dataset.map) to fixed length or max_length make sense with subsequent batch_size in creating DataLoader. Otherwise, if I…

Padding in datasets

RylanSchaeffer October 11, 2021, 8:12pm 6

@maximin what was your solution in place of lambda entry: self.tokenizer(entry[ padding=True,)?

Topic		Replies	Views
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5174	June 21, 2023
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10431	August 10, 2023
Not sure why padding isn't working for me Beginners	2	1600	January 22, 2021
Odd dataset.map() behavior with PyTorch dataloader 🤗Datasets	2	236	March 25, 2024
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2530	May 9, 2022