We (torchgeo (TorchGeo)) just finished creating unlabeled satellite imagery datasets for SSL pre-training. There are 5 satellites, each with ~400 GB tarballs, for a total dataset size of 2 TB. We’re wondering if it’s possible to store this dataset on HF, but have a few questions:
- Is there a maximum repository size? We noticed that there is a 50 GB/file limit, although we can split each tarball into multiple files to get around this.
- Should each satellite have its own dataset repository, or should all 2 TB be in a single repository? This is similar to How to organize hundreds of pre-trained models but for datasets instead of models.
- Is this best practice? Even if it’s theoretically possible, we want to be good citizens and avoid using more storage than we’re supposed to.