Padding in datasets

I usually use padding in batches before I get into the datasets library.

I found that dataset.map support batched and batch_size. But it seems that only padding all examples (in dataset.map) to fixed length or max_length make sense with subsequent batch_size in creating DataLoader.

Otherwise, if I use map function like lambda x: tokenizer(x["sentence"], padding=True, truncation=True) I get errors like RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [53] at entry 8 when iterating the dataloader since I could not find a way to iterating the same batches in datasets.map in dataloader.

Am I right?

That’s because padding=True makes the tokenization pad to the longest sequence in the batch. Therefore two batches may have different length