How to handle very large datasets

We (torchgeo (TorchGeo)) just finished creating unlabeled satellite imagery datasets for SSL pre-training. There are 5 satellites, each with ~400 GB tarballs, for a total dataset size of 2 TB. We’re wondering if it’s possible to store this dataset on HF, but have a few questions:

  1. Is there a maximum repository size? We noticed that there is a 50 GB/file limit, although we can split each tarball into multiple files to get around this.
  2. Should each satellite have its own dataset repository, or should all 2 TB be in a single repository? This is similar to How to organize hundreds of pre-trained models but for datasets instead of models.
  3. Is this best practice? Even if it’s theoretically possible, we want to be good citizens and avoid using more storage than we’re supposed to.

Hi! Some datasets on the Hub are larger than this, so it shouldn’t be a problem :slightly_smiling_face:.

Answers to your questions:

  1. Yes, splitting the files into chunks smaller than 50GB is the preferred solution
  2. You can have one config per satellite to allow fetching the images of a specific satellite.
  3. I think this boils down to choosing the right compression type.