Can I Use upload_folder in Multiprocessing Mode for Large Subfolders on Hugging Face Hub?

Hello Hugging Face Community,

I am currently working with a repository structure where each subfolder contains approximately 30 GB of data with around 200 .jsonl files each. I am exploring efficient ways to upload these large datasets to my Hugging Face Hub repository.

I am considering using the upload_folder function from the huggingface_hub library (with hf_transfer) in a multiprocessing environment, where each process would handle the upload of one subfolder independently. Here is an overview of my repository structure for clarity:

repository-name/
├── subfolder1/
│   ├── file1.jsonl
│   ├── file2.jsonl
│   ... (around 200 files)
├── subfolder2/
│   ├── file1.jsonl
│   ├── file2.jsonl
│   ... (around 200 files)
├── ...

My Questions:

  1. Is it feasible to use upload_folder in a multiprocessing setup without running into issues like race conditions or API rate limits?
  2. If multiprocessing is possible, could you provide guidance or examples on how to implement this efficiently?
  3. Are there any best practices or alternative methods recommended for uploading large datasets to the Hugging Face Hub?

Any insights or experiences with similar tasks would be greatly appreciated!

Thank you in advance for your help.

cc @Wauplin

Hi @nicofirst1, we currently don’t have a robust way to upload such large dataset with multiprocessing + retry. The current best approach is to use this script there I wrote a few weeks ago: robust_upload.py · SPRIGHT-T2I/spright at main. What it does is to upload files by chunks of ~50 files which avoid running into rate limits. Also it preuploads files on the fly and uses a local dir to keep track of what hashed/uploaded/committed so that you don’t loose your progress while it’s uploading (in case of failure). Finally it’s robust enough if you want to run it from multiple sessions.

That been said, please remember that hf_transfer is already meant to use all the available bandwidth when uploading large files (+ all possible CPUs). This usually means that running multiple processes in parallel might not improve overall upload speed and can even decrease it. Hope this will help you!

Hello Wauplin,

Thanks for the quick reply and the details about the script. It seems well-equipped for large datasets. How does its speed and robustness compare to the standard upload_folder method from Hugging Face Hub, which I thought might already handle large uploads effectively?