How to load a large hf dataset efficiently?

aaditya · January 16, 2024, 3:52am

I am trying to load a dataset axiong/pmc_oa · Datasets at Hugging Face The dataset size is around 22 gb and I have ram ~10 GB, the dataset object is stuck at extracting file point

I also tried streaming mode but that’s giving another error.

from datasets import load_dataset
dataset = load_dataset("axiong/pmc_oa", 'pmc_oa', split='train', streaming=True)
print(next(iter(dataset)))

Any suggestion on how to deal with large datasets to load in a better manner?

lhoestq · January 16, 2024, 4:55pm

This dataset is made of one big 20GB+ zip file and a python file to load the data, which is rather unoptimized for data loading.

Ideally a dataset of this size should be split into multiple shards to be loaded in parallel, and a format suited for big image datasets (we usually recommend WebDataset or Parquet).

Anyway I think it just takes some time to extract the big zip file

aaditya · January 16, 2024, 7:22pm

Thank you for the reply, following your suggestion is there any way in hf dataset, split into multiple shards to be loaded in parallel?

lhoestq · January 18, 2024, 12:13pm

Saving a dataset on HF using .push_to_hub() does upload multiple shards.
In particular it splits the dataset in shards of 500MB and uploads each shard as a Parquet file on HF.

It’s also possible to manually get a shard of a dataset using the .shard() method

samcaetano · January 20, 2024, 11:14pm

@lhoestq what do you think about loading this dataset in a lazy way, for example loading it as a batch of 64 samples at a time? Is it more inefficient than loading those shards of data?

lhoestq · January 22, 2024, 6:02pm

It’s generally a good idea indeed if you want to save disk space or to not have to wait to download the full dataset.

For example you can stream the dataset using the datasets library, by passing streaming=True to load_dataset().

However even in streaming mode, you better have multiple shards in order to do parallel streaming. And ideally use file formats that work well with streaming like WebDataset or Parquet.

@aaditya did try streaming mode on this dataset but the custom loading script of this dataset uses the jsonlines package that we don’t support for streaming (we only extend the builtin open() for streaming)

Topic		Replies	Views
How do i load part of the data set Beginners	3	98	May 5, 2025
Streaming in dataset uploads 🤗Datasets	2	71	March 31, 2025
Recommended max size of dataset? 🤗Datasets	5	246	March 11, 2025
OOM issue with large dataset streaming 🤗Datasets	6	156	March 15, 2025
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	247	September 16, 2024

How to load a large hf dataset efficiently?

Related topics