Recommended max size of dataset?

Aceticia · March 8, 2025, 9:41pm

I’m about to create a large dataset directly, about ~1B samples with each sample being about [16 x 8000] size and some small meta data. Do you foresee any issues during generation, or loading this and using it after it’s finished generating? Any ideas are welcome, thank you.

John6666 · March 9, 2025, 5:01am

It’s probably going to be over 500TB…

If you’re going to upload more than 300GB of data to Hugging Face in a single repository, it’s better to consult with HF in advance by email. website@huggingface.co

Also, if you’re using a large dataset for training with Hugging Face’s library or torch, it seems that sharding the dataset will make it run more stably. @lhoestq

Aceticia · March 9, 2025, 5:49am

Hi, thanks for the quick reply! It would be just for training, so upload is not a problem. And I have individual files that I will use Dataset.from_generator to create a hf dataset out of, so I think the post you mentioned shouldn’t be a problem either.

I guess I’m more concerned about whether save_to_disk would work for something this big, and whether Dataset.load_from_disk would be problematic in terms of the number of open files?

John6666 · March 9, 2025, 5:55am

When it comes to such a huge data set, that’s probably the case…

It’s probably too much for those functions that use the default torch internally, so it might be more stable to use functions related to WebDataset. I think there are other backends or functions that can be used as needed for huge data sets, but I can’t remember…

github.com/huggingface/datasets

Support webdataset format

opened 11:32AM - 07 Dec 22 UTC

closed 02:39PM - 06 Mar 24 UTC

lhoestq

Webdataset is an efficient format for iterable datasets. It would be nice to sup…port it in `datasets`, as discussed in https://github.com/rom1504/img2dataset/issues/234. In particular it would be awesome to be able to load one using `load_dataset` in streaming mode (either from a local directory, or from a dataset on the Hugging Face Hub). Some datasets on the Hub are already in webdataset format. It terms of implementation, we can have something similar to the Parquet loader. I also think it's fine to have webdataset as an optional dependency.

lhoestq · March 11, 2025, 3:22pm

save_to_disk / load_from_disk can handle big datasets, you can even use multiprocessing with num_proc= to accelerate save_to_disk

though performance can depend on your environment so I’d still advise you to try on smaller datasets first and see how it scales

system · March 12, 2025, 5:48pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to load a large hf dataset efficiently? 🤗Datasets	5	2363	January 22, 2024
Streaming in dataset uploads 🤗Datasets	2	50	March 31, 2025
Big text dataset loading for training 🤗Datasets	2	85	May 7, 2025
Request for Additional Storage Space for Dataset Repository 🤗Datasets	3	108	October 11, 2024
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024

Recommended max size of dataset?

Related topics