How to load a large hf dataset efficiently?

lhoestq · January 18, 2024, 12:13pm

Saving a dataset on HF using .push_to_hub() does upload multiple shards.
In particular it splits the dataset in shards of 500MB and uploads each shard as a Parquet file on HF.

It’s also possible to manually get a shard of a dataset using the .shard() method

Topic		Replies	Views
How do i load part of the data set Beginners	3	127	May 5, 2025
Streaming in dataset uploads 🤗Datasets	2	105	March 31, 2025
Recommended max size of dataset? 🤗Datasets	5	392	March 11, 2025
OOM issue with large dataset streaming 🤗Datasets	6	223	March 15, 2025
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	283	September 16, 2024

How to load a large hf dataset efficiently?

Related topics