I’m trying to upload my first dataset. It’s only ~75,000 files and about 1GB but I was immediately I got 429 errors.
It is possible to mitigate this in Pro or Enterprise, but it may be quicker to reduce the number of requests.
For example, uploading with upload_folder instead of upload_file will result in fewer requests.
I was using hugging-face-cli upload-large-folder
Is upload_folder better?
The large version is still under development and is intended for cases where the size is truly large, so I think upload_folder is better if the total size is within 50 GB.
Does that mean:
huggingface-cli upload_folder username/repository /path/to/dataset --repo-type=dataset --num-workers=12
I’m getting that both upload_folder and upload-folder are invalid arguments.
valid choices {download,upload,repo-files,env,login,whoami,logout,auth,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag,version,upload-large-folder}
Thanks.
It works with
huggingface-cli upload mysocratesnote/jfk-files-text ~/Desktop/extracted_text/releases --repo-type=dataset
But it’s recommending I do it another way:
Consider using hf_transfer
for faster uploads. This solution comes with some limitations. See Environment variables for more details.
It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use HfApi().upload_large_folder(...)
/huggingface-cli upload-large-folder
instead. For more details, check out Upload files to the Hub.
Start hashing 73480 files.
Finished hashing 73480 files.
This failed shortly after it started with the ‘upload’ option.
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/utils/_http.py”, line 409, in hf_raise_for_status
response.raise_for_status()
File “/Users/user/miniforge3/lib/python3.12/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/mysocratesnote/jfk-files-text/commit/main
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/Users/user/miniforge3/bin/huggingface-cli”, line 8, in
sys.exit(main())
^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/commands/huggingface_cli.py”, line 57, in main
service.run()
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/commands/upload.py”, line 206, in run
print(self._upload())
^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/commands/upload.py”, line 301, in _upload
return self.api.upload_folder(
^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py”, line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/hf_api.py”, line 1624, in _inner
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/hf_api.py”, line 4934, in upload_folder
commit_info = self.create_commit(
^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py”, line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/hf_api.py”, line 1624, in _inner
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/hf_api.py”, line 4285, in create_commit
hf_raise_for_status(commit_resp, endpoint_name=“commit”)
File “/Users/user/miniforge3/lib/python3.12/site-packages/huggingface_hub/utils/_http.py”, line 482, in hf_raise_for_status
raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/mysocratesnote/jfk-files-text/commit/main
It works with upload-large-folder for a little while, but even with --num-workers=2 it quickly hits a rate limit again.
Is there no way to upload and specify a rate under the limit?
That’s strange. I don’t think the numbers are high enough to cause an error… @Wauplin
Not a server-side hardcoded limit no but might be some technical issues. We are working on Fix dynamic commit size by maximizemaxwell · Pull Request #3016 · huggingface/huggingface_hub · GitHub to allow dynamic commit sizes which should help mitigate the issue.
Thank you!
Not sure I understand all of this. Is there a way to upload at the “right speed” to avoid getting blocked? I’m on my third or fourth attempt. I can only upload a few thousand files without getting blocked. Even with a single worker, the rate limit keeps triggering. This is not a huge archive, it’s just a little over 1 GB. It’s taken about 4 hours just to upload about 1/3 of it.
The problem is not the total size but the number of files (around 70k+ in total?). upload-large-folder was not meant for that at first (my bad, I designed it mostly to upload folders with hundreds of large files instead of folders with tens of thousands of small files). Result is that we are committing them by chunk of 50 which is making hundreds of commits which triggers the rate limit. Fix dynamic commit size by maximizemaxwell · Pull Request #3016 · huggingface/huggingface_hub · GitHub is meant as a good workaroudn for that but it’s not finished yet.
In the meantime I don’t have many suggestions except making the upload more manual (i.e. running huggingface-cli upload
on subparts of the repo)
This problem may be exacerbated by the fact that huggingface-cli keeps trying over and over after it already hit a 429 error. Ideally it would quit after getting that error a couple of times.
Oooh, I did not notice that the files are been uploaded as regular markdown files. This means that all the data is stored in the git history, not using LFS files stored on S3. This is most certainly the culprit. Usually we try to avoid storing data as “raw” as it makes everything very slow. This is why git+LFS (and now git+xet) has been developed.
If that doesn’t make any sense to you, it basically means that the way files are stored on the repo is not optimized. I would recommend:
- create a new separate repo
- make sure the .md files are tracked as LFS (can be done by modifying the .gitattributes files .gitattributes · mysocratesnote/jfk-files-text at main)
- upload files subpart by subpart (around 250 by 250 is good)
- once everything is uploaded, delete the original repo and move the new one under the previous namespace
Very sorry about this situation but I think starting with a clean state is really needed here.
It seems like it failed even faster when using ‘upload’ rather than ‘upload-large-folder’
When you say run it on parts of the repo one by one… OK but how do I ensure it’s uploaded to the right path?
If the repository looks like this:
├── 2017/ # 2017 release
│ ├── part_1/ # 2017 part 1
│ ├── part_2/ # 2017 part 2
│ ├── part_3/ # 2017 part 3
│ ├── part_4/ # 2017 part 4
│ └── part_5/ # 2017 part 5 (originally labeled "additional")
├── 2018/ # 2018 release
│ ├── part_1/ # 2018 part 1
│ └── part_2/ # 2018 part 2
Do use huggingface-cli like this if I want to start with the 2017 subfolder?
huggingface-cli upload mysocratesnote/jfk-files-text/2017 ~/Desktop/extracted_text/releases/2017 --repo-type=dataset
Thanks.
Other (better) solution is to store the data in a format that does not require to upload each file individually. Typically, it could .parquet files with columns like “date” “filename” “content” where each row is a markdown file. This way you will have only a few .parquet files to upload which will solve all of your problems. Also, it will enable the Dataset Studio for your repo.
I think I figured that out. Thanks.