LImit of 100000 files

Hello!

I am trying to upload a dataset with 231685 audio files but apparently there is a limit in the number of files.

remote: Your push was rejected because it contains too many files.
remote: Your git repo would contain 231687 files after this push, over
remote: the limit of 100000 files.

What is the best practice for such datasets? Should I create an archive?

Hi @ccoreilly,

Grouping files in archives could be a solution indeed, but there might be solutions more adapted here - pinging @lhoestq @mariosasko @albertvillanova as datasets experts :slight_smile:

I’d highly recommend to use archives indeed. Many filesystems wouldn’t support having so many files in a directory very well. Feel free to use TAR archives for examples (ZIP uses compression which is not ideal for already-compressed audio files).

And if you have many archives, it also allows people to download a subset of the dataset easily, and it’s also a nice bonus for parallel downloading and processing :wink:

1 Like