Recommended max size of dataset?

I’m about to create a large dataset directly, about ~1B samples with each sample being about [16 x 8000] size and some small meta data. Do you foresee any issues during generation, or loading this and using it after it’s finished generating? Any ideas are welcome, thank you.

1 Like

It’s probably going to be over 500TB…

If you’re going to upload more than 300GB of data to Hugging Face in a single repository, it’s better to consult with HF in advance by email. website@huggingface.co

Also, if you’re using a large dataset for training with Hugging Face’s library or torch, it seems that sharding the dataset will make it run more stably. @lhoestq

Hi, thanks for the quick reply! It would be just for training, so upload is not a problem. And I have individual files that I will use Dataset.from_generator to create a hf dataset out of, so I think the post you mentioned shouldn’t be a problem either.

I guess I’m more concerned about whether save_to_disk would work for something this big, and whether Dataset.load_from_disk would be problematic in terms of the number of open files?

1 Like

When it comes to such a huge data set, that’s probably the case…

It’s probably too much for those functions that use the default torch internally, so it might be more stable to use functions related to WebDataset. I think there are other backends or functions that can be used as needed for huge data sets, but I can’t remember…:sweat_smile:

save_to_disk / load_from_disk can handle big datasets, you can even use multiprocessing with num_proc= to accelerate save_to_disk

though performance can depend on your environment so I’d still advise you to try on smaller datasets first and see how it scales

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.