Hello Hugging Face Community,
I am currently working with a repository structure where each subfolder contains approximately 30 GB of data with around 200 .jsonl
files each. I am exploring efficient ways to upload these large datasets to my Hugging Face Hub repository.
I am considering using the upload_folder
function from the huggingface_hub
library (with hf_transfer) in a multiprocessing environment, where each process would handle the upload of one subfolder independently. Here is an overview of my repository structure for clarity:
repository-name/
├── subfolder1/
│ ├── file1.jsonl
│ ├── file2.jsonl
│ ... (around 200 files)
├── subfolder2/
│ ├── file1.jsonl
│ ├── file2.jsonl
│ ... (around 200 files)
├── ...
My Questions:
- Is it feasible to use
upload_folder
in a multiprocessing setup without running into issues like race conditions or API rate limits? - If multiprocessing is possible, could you provide guidance or examples on how to implement this efficiently?
- Are there any best practices or alternative methods recommended for uploading large datasets to the Hugging Face Hub?
Any insights or experiences with similar tasks would be greatly appreciated!
Thank you in advance for your help.