Slow DataLoader with big batch_size

lneukom · October 3, 2023, 10:15am

Hi,

I’m investigating datasets for loading tabular data with pytorch. I’m having problems getting good performance when using both map and iterable style datasets:

from datasets import Dataset, load_from_disk
import time
from torch.utils.data import DataLoader

if __name__ == '__main__':
    DATASET_SIZE = 10_000_000
    BATCH_SIZE = 10_000
    MAX_NUM_ITER = 10

    ds = Dataset.from_dict({"idx": range(DATASET_SIZE)})
    ds.save_to_disk('test.hf')

    print('Loading dataset')
    ds = load_from_disk('test.hf')

    # map style dataset
    print('Running map style')
    data_loader = DataLoader(ds.with_format('torch'), batch_size=BATCH_SIZE)
    start = time.time()
    for i, batch in enumerate(iter(data_loader)):
        if i > MAX_NUM_ITER:
            break
    print((time.time() - start) / MAX_NUM_ITER)

    # iter style dataset
    print('Running iter style')
    iter_ds = ds.to_iterable_dataset()
    data_loader = DataLoader(iter_ds, batch_size=BATCH_SIZE)
    start = time.time()
    for i, batch in enumerate(iter(data_loader)):
        print(batch)
        if i > MAX_NUM_ITER:
            break
    print((time.time() - start) / MAX_NUM_ITER)

Results in

0.146s per iteration for map style and
0.686s per iteration for iterable style

If I understand the setup correctly (Differences between Dataset and IterableDataset) I should use the IterableDataset in combination with torch so I can do shuffling without getting speed issues:

However as soon as your Dataset has an indices mapping (via Dataset.shuffle()) for example), the speed can become 10x slower.

If you want to shuffle your dataset or use it with a PyTorch DataLoader, we recommend generating a sharded IterableDataset:

Is there any way to make dataloading faster? I’m getting much better results just using a numpy memory mapped file and directly loading the batches, i.e. something like

arr[start_idx:start_idx+BATCH_SIZE]

but atleast the IterableDataset should be doing the same since its just loading consecutive values from a memory mapped file.

Thanks, Lukas

lhoestq · October 5, 2023, 9:18am

Arrow is a more complex data format that raw Numpy, but we put a lot of effort into making it as fast as possible to read In particular IterableDataset objects provide excellent performance.

Maybe you can try to use an IterableDataset with ds.iter(batch_size=BATCH_SIZE) ? instead of letting the DataLoader pick examples one by one and merge them into batches (thus, copying the data).

Also make sure to use the latest version of datasets. Many speed improvements have been added especially in datasets 2.13

lneukom · October 5, 2023, 10:03am

Hey, thanks for the response. I’m using 2.14.5. ds.iter(batch_size=BATCH_SIZE) is much faster (a few ms per iteration), thanks!

What’s the best way to use this in combination with a DataLoader? Wrap it in an IterableDataset and return whole batches during iteration?
do I need to take care of calling shuffle as well as multi process and multi gpu data loading (num_workers > 0) myself in this case or is this still handled when using ds.iter?

lhoestq · October 5, 2023, 10:53am

Hmm I think currently you can’t use this batching method with a DataLoader, since it returns a bare iterator (not a torch.utils.data.IterableDataset).

But it would be cool to implement IterableDataset.batch() that would return an IterableDataset that yields batches. If if it’s a thing that would be useful for you feel free to open an issue at Issues · huggingface/datasets · GitHub and we can take a look at it

lneukom · October 5, 2023, 11:13am

Created a feature request here, thanks!

Topic		Replies	Views
Iterating on dataset extremely slow 🤗Datasets	8	1761	November 6, 2024
Slow Iteration speed (with and without keep_in_memory=True) 🤗Datasets	3	1383	March 14, 2023
Roadmap/timeline for dataset streaming 🤗Datasets	9	2267	July 5, 2021
Num_worker with IterableDataset 🤗Datasets	4	2562	November 16, 2023
Getting correct length via DataLoader and speed 🤗Datasets	4	442	April 5, 2024

Slow DataLoader with big batch_size

Related topics