How to load a large hf dataset efficiently?

I am trying to load a dataset axiong/pmc_oa · Datasets at Hugging Face The dataset size is around 22 gb and I have ram ~10 GB, the dataset object is stuck at extracting file point

I also tried streaming mode but that’s giving another error.

from datasets import load_dataset
dataset = load_dataset("axiong/pmc_oa", 'pmc_oa', split='train', streaming=True)

Any suggestion on how to deal with large datasets to load in a better manner?

1 Like

This dataset is made of one big 20GB+ zip file and a python file to load the data, which is rather unoptimized for data loading.

Ideally a dataset of this size should be split into multiple shards to be loaded in parallel, and a format suited for big image datasets (we usually recommend WebDataset or Parquet).

Anyway I think it just takes some time to extract the big zip file

Thank you for the reply, following your suggestion is there any way in hf dataset, split into multiple shards to be loaded in parallel?

Saving a dataset on HF using .push_to_hub() does upload multiple shards.
In particular it splits the dataset in shards of 500MB and uploads each shard as a Parquet file on HF.

It’s also possible to manually get a shard of a dataset using the .shard() method

@lhoestq what do you think about loading this dataset in a lazy way, for example loading it as a batch of 64 samples at a time? Is it more inefficient than loading those shards of data?

It’s generally a good idea indeed if you want to save disk space or to not have to wait to download the full dataset.

For example you can stream the dataset using the datasets library, by passing streaming=True to load_dataset().

However even in streaming mode, you better have multiple shards in order to do parallel streaming. And ideally use file formats that work well with streaming like WebDataset or Parquet.

@aaditya did try streaming mode on this dataset but the custom loading script of this dataset uses the jsonlines package that we don’t support for streaming (we only extend the builtin open() for streaming)