How to handle very large datasets

ajstewart · June 9, 2023, 4:48pm

We (torchgeo (TorchGeo)) just finished creating unlabeled satellite imagery datasets for SSL pre-training. There are 5 satellites, each with ~400 GB tarballs, for a total dataset size of 2 TB. We’re wondering if it’s possible to store this dataset on HF, but have a few questions:

Is there a maximum repository size? We noticed that there is a 50 GB/file limit, although we can split each tarball into multiple files to get around this.
Should each satellite have its own dataset repository, or should all 2 TB be in a single repository? This is similar to How to organize hundreds of pre-trained models but for datasets instead of models.
Is this best practice? Even if it’s theoretically possible, we want to be good citizens and avoid using more storage than we’re supposed to.

mariosasko · June 12, 2023, 8:31pm

Hi! Some datasets on the Hub are larger than this, so it shouldn’t be a problem .

Answers to your questions:

Yes, splitting the files into chunks smaller than 50GB is the preferred solution
You can have one config per satellite to allow fetching the images of a specific satellite.
I think this boils down to choosing the right compression type.

Topic		Replies	Views
Recommended max size of dataset? 🤗Datasets	5	132	March 11, 2025
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024
How does cache work? 🤗Datasets	1	355	November 28, 2023
Streaming in dataset uploads 🤗Datasets	2	50	March 31, 2025
Request for Additional Storage Space for Dataset Repository 🤗Datasets	3	109	October 11, 2024

How to handle very large datasets

Related topics