Can I Use upload_folder in Multiprocessing Mode for Large Subfolders on Hugging Face Hub?

nicofirst1 · April 18, 2024, 7:24am

Hello Hugging Face Community,

I am currently working with a repository structure where each subfolder contains approximately 30 GB of data with around 200 .jsonl files each. I am exploring efficient ways to upload these large datasets to my Hugging Face Hub repository.

I am considering using the upload_folder function from the huggingface_hub library (with hf_transfer) in a multiprocessing environment, where each process would handle the upload of one subfolder independently. Here is an overview of my repository structure for clarity:

repository-name/
├── subfolder1/
│   ├── file1.jsonl
│   ├── file2.jsonl
│   ... (around 200 files)
├── subfolder2/
│   ├── file1.jsonl
│   ├── file2.jsonl
│   ... (around 200 files)
├── ...

My Questions:

Is it feasible to use upload_folder in a multiprocessing setup without running into issues like race conditions or API rate limits?
If multiprocessing is possible, could you provide guidance or examples on how to implement this efficiently?
Are there any best practices or alternative methods recommended for uploading large datasets to the Hugging Face Hub?

Any insights or experiences with similar tasks would be greatly appreciated!

Thank you in advance for your help.

severo · April 18, 2024, 8:29am

cc @Wauplin

Wauplin · April 18, 2024, 9:27am

Hi @nicofirst1, we currently don’t have a robust way to upload such large dataset with multiprocessing + retry. The current best approach is to use this script there I wrote a few weeks ago: robust_upload.py · SPRIGHT-T2I/spright at main. What it does is to upload files by chunks of ~50 files which avoid running into rate limits. Also it preuploads files on the fly and uses a local dir to keep track of what hashed/uploaded/committed so that you don’t loose your progress while it’s uploading (in case of failure). Finally it’s robust enough if you want to run it from multiple sessions.

That been said, please remember that hf_transfer is already meant to use all the available bandwidth when uploading large files (+ all possible CPUs). This usually means that running multiple processes in parallel might not improve overall upload speed and can even decrease it. Hope this will help you!

nicofirst1 · April 18, 2024, 10:32am

Hello Wauplin,

Thanks for the quick reply and the details about the script. It seems well-equipped for large datasets. How does its speed and robustness compare to the standard upload_folder method from Hugging Face Hub, which I thought might already handle large uploads effectively?

Topic		Replies	Views
How to upload big jsonl files effeciently? 🤗Datasets	2	545	November 4, 2023
Api.upload_folder() 🤗Hub	0	313	February 18, 2024
.cache for upload large folder 🤗Datasets	3	27	March 28, 2025
Failed to commit 504 Server Error Gateway Time-out for url Beginners	1	66	December 26, 2024
Cannot push dataset of 100k images 🤗Hub	0	576	March 26, 2023

Can I Use upload_folder in Multiprocessing Mode for Large Subfolders on Hugging Face Hub?

Related topics