Slow DataLoader with big batch_size

Hi,

I’m investigating datasets for loading tabular data with pytorch. I’m having problems getting good performance when using both map and iterable style datasets:

from datasets import Dataset, load_from_disk
import time
from torch.utils.data import DataLoader

if __name__ == '__main__':
    DATASET_SIZE = 10_000_000
    BATCH_SIZE = 10_000
    MAX_NUM_ITER = 10

    ds = Dataset.from_dict({"idx": range(DATASET_SIZE)})
    ds.save_to_disk('test.hf')

    print('Loading dataset')
    ds = load_from_disk('test.hf')

    # map style dataset
    print('Running map style')
    data_loader = DataLoader(ds.with_format('torch'), batch_size=BATCH_SIZE)
    start = time.time()
    for i, batch in enumerate(iter(data_loader)):
        if i > MAX_NUM_ITER:
            break
    print((time.time() - start) / MAX_NUM_ITER)

    # iter style dataset
    print('Running iter style')
    iter_ds = ds.to_iterable_dataset()
    data_loader = DataLoader(iter_ds, batch_size=BATCH_SIZE)
    start = time.time()
    for i, batch in enumerate(iter(data_loader)):
        print(batch)
        if i > MAX_NUM_ITER:
            break
    print((time.time() - start) / MAX_NUM_ITER)

Results in

  • 0.146s per iteration for map style and
  • 0.686s per iteration for iterable style

If I understand the setup correctly (Differences between Dataset and IterableDataset) I should use the IterableDataset in combination with torch so I can do shuffling without getting speed issues:

However as soon as your Dataset has an indices mapping (via Dataset.shuffle()) for example), the speed can become 10x slower.

If you want to shuffle your dataset or use it with a PyTorch DataLoader, we recommend generating a sharded IterableDataset:

Is there any way to make dataloading faster? I’m getting much better results just using a numpy memory mapped file and directly loading the batches, i.e. something like

arr[start_idx:start_idx+BATCH_SIZE]

but atleast the IterableDataset should be doing the same since its just loading consecutive values from a memory mapped file.

Thanks, Lukas

Arrow is a more complex data format that raw Numpy, but we put a lot of effort into making it as fast as possible to read :slight_smile: In particular IterableDataset objects provide excellent performance.

Maybe you can try to use an IterableDataset with ds.iter(batch_size=BATCH_SIZE) ? instead of letting the DataLoader pick examples one by one and merge them into batches (thus, copying the data).

Also make sure to use the latest version of datasets. Many speed improvements have been added especially in datasets 2.13

Hey, thanks for the response. I’m using 2.14.5. ds.iter(batch_size=BATCH_SIZE) is much faster (a few ms per iteration), thanks!

  • What’s the best way to use this in combination with a DataLoader? Wrap it in an IterableDataset and return whole batches during iteration?
  • do I need to take care of calling shuffle as well as multi process and multi gpu data loading (num_workers > 0) myself in this case or is this still handled when using ds.iter?

Hmm I think currently you can’t use this batching method with a DataLoader, since it returns a bare iterator (not a torch.utils.data.IterableDataset).

But it would be cool to implement IterableDataset.batch() that would return an IterableDataset that yields batches. If if it’s a thing that would be useful for you feel free to open an issue at Issues · huggingface/datasets · GitHub and we can take a look at it

Created a feature request here, thanks!