I usually use padding in batches before I get into the datasets library.
I found that dataset.map support batched and batch_size. But it seems that only padding all examples (in dataset.map) to fixed length or max_length make sense with subsequent batch_size in creating DataLoader.
Otherwise, if I use map function like lambda x: tokenizer(x["sentence"], padding=True, truncation=True) I get errors like RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [53] at entry 8 when iterating the dataloader since I could not find a way to iterating the same batches in datasets.map in dataloader.
Padding all the examples to the same length makes the training slower compared to training with padding to the maximum length per batch.
You can have a data_collator in your pytorch dataloader that does the padding to the maximum length per batch.
padding=True in the data_collator does the padding to the maximum length of the batch, so that’s the way to go
But if you want to do the tokenization in map instead of in the data collator you can, but you must add an extra padding step in the data_collator to make sure all the examples in each batch have the same length