Max individual file size for LFS files is 46.6GB

Hi,

There seems to be a new limit for datasets, and I was just wondering if this is expected behavior. I’ve been pushing yearly zipped-Zarr stores for US precipitation radar data to openclimatefix/mrms · Datasets at Hugging Face successfully, each one being around 100-130GB each. I just tried to push an updated and fixed one for 2018, and am now having a new error that the max size is 46.6GB? I can split the Zarr stores into smaller ones, but it is simpler and easier to have a single large Zarr store that is read once.

(dgmr) [jacob@ocf mrms]$ git push
batch response: jects:   0% (0/1), 0 B | 0 B/s                                                                                                                                                                                               
You need to configure your repository to enable upload of files > 5GB.
Run "huggingface-cli lfs-enable-largefiles ./path/to/your/repo" and try again.

error: failed to push some refs to 'https://huggingface.co/datasets/openclimatefix/mrms'
(dgmr) [jacob@ocf mrms]$ huggingface-cli lfs-enable-largefiles .
Local repo set up for largefiles
(dgmr) [jacob@ocf mrms]$ git push
[0f7bef0d818fe9c05f7c821bf4b66f9218a4bae1a1ad2ae6274288f687704f28] Max individual file size for LFS files: 46.6GB: [422] Max individual file size for LFS files: 46.6GB                                                                      
error: failed to push some refs to 'https://huggingface.co/datasets/openclimatefix/mrms'
(dgmr) [jacob@ocf mrms]$ git push
[0f7bef0d818fe9c05f7c821bf4b66f9218a4bae1a1ad2ae6274288f687704f28] Max individual file size for LFS files: 46.6GB: [422] Max individual file size for LFS files: 46.6GB                                                                      
Uploading LFS objects: 100% (1/1), 125 GB | 0 B/s, done.
error: failed to push some refs to 'https://huggingface.co/datasets/openclimatefix/mrms'
1 Like

Hi ! I’d suggest to split your files into smaller ones.

It is simpler for many systems to handle files that are around 1-2GB each. It helps to parallelize data transfer and data processing without having memory issues.

Okay yeah, I’ll split them into smaller ones then, was hoping to just keep the large files as xarray and zarr are built for accessing large-than-memory datasets lazily and work a bit more efficiently if they don’t have to read multiple files’ metadata. But that is just a small thing, so I’ll work on keeping the files a bit smaller then. Thanks!