Getting correct length via DataLoader and speed

Hi ! batch_size in map() is only to batch inside the map operation - this doesn’t batch the output of the dataset once you iterate on it. If you want your data loader to yield batches, you should pass batch_size to the data loader:

train_dl = DataLoader(
    train_dataset.map(
        collate_fn, batched=True, batch_size=10, remove_columns=["url", "short_caption", "caption"]
    ),
    batch_size=10
)