Getting correct length via DataLoader and speed

lhoestq · April 3, 2024, 12:30pm

Hi ! batch_size in map() is only to batch inside the map operation - this doesn’t batch the output of the dataset once you iterate on it. If you want your data loader to yield batches, you should pass batch_size to the data loader:

train_dl = DataLoader(
    train_dataset.map(
        collate_fn, batched=True, batch_size=10, remove_columns=["url", "short_caption", "caption"]
    ),
    batch_size=10
)

Topic		Replies	Views
Streaming batched data 🤗Datasets	4	4007	October 5, 2023
Why use batched=True in map function? 🤗Datasets	2	7567	May 17, 2022
Padding in datasets 🤗Datasets	6	5137	October 21, 2021
Odd dataset.map() behavior with PyTorch dataloader 🤗Datasets	2	246	March 25, 2024
Streaming datasets and batched mapping 🤗Datasets	5	2742	January 10, 2022

Getting correct length via DataLoader and speed

Related topics