How to load a large hf dataset efficiently?

lhoestq · January 22, 2024, 6:02pm

It’s generally a good idea indeed if you want to save disk space or to not have to wait to download the full dataset.

For example you can stream the dataset using the datasets library, by passing streaming=True to load_dataset().

However even in streaming mode, you better have multiple shards in order to do parallel streaming. And ideally use file formats that work well with streaming like WebDataset or Parquet.

@aaditya did try streaming mode on this dataset but the custom loading script of this dataset uses the jsonlines package that we don’t support for streaming (we only extend the builtin open() for streaming)

Topic		Replies	Views
Allow streaming of large datasets with image/audio 🤗Datasets	18	3979	May 30, 2022
Use load dataset to load a sample of the dataset 🤗Datasets	3	1277	May 24, 2021
Handling large image datasets 🤗Datasets	7	1945	June 30, 2022
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	494	March 17, 2023
Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer 🤗Datasets	4	2526	June 24, 2021

How to load a large hf dataset efficiently?

Related topics