I am trying to push a large dataset with help of
dataset.push_to_hub()
While pushing, it gives HTTP error :504
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/name/repo-name/upload/main/data/train2-00041-of-00064.parquet
How to avoid this whenever pushing large datasets
Hi ! As discussed on GitHub, feel free to try again (the server might have had some issues).
Weβre also adding a retry mechanism to workaround 504 errors: Retry HfApi call inside push_to_hub when 504 error by albertvillanova Β· Pull Request #3886 Β· huggingface/datasets Β· GitHub
Weβll do a new release of datasets
soon to include this
1 Like
Iβve been getting this today (200GB dataset, push in python), just love to raise this is still a problem (and itβd be so great to be able to resume pushes since they can take a really long time)
Creating parquet from Arrow format: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββ| 48/48 [00:00<00:00, 329.12ba/s]
Uploading the dataset shards: 46%|βββββββββββββββββββββββ | 192/420 [6:08:02<7:17:03, 115.01s/it]
Traceback (most recent call last):
File β/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.pyβ, line 409, in hf_raise_for_status
response.raise_for_status()
File β/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/requests/models.pyβ, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/complete_multipart?uploadId=ak3yajyFaNpfIpmsLljywyUVb1sNay5D5GUGKAjJcM1h2GjHSHWLUuXNQ2.rxlPK9Mydu5w.5iCJ5P9SRIw4tbO2Gk9bXPRqPshRZlskHKa.tpGImodDyUjcU1yH92La&bucket=hf-hub-lfs-us-east-1&prefix=repos%2Fbe%2Feb%2Fbeebea985423ba0ccb6c7ef0c6925225a4dcd28c85e8c4db44607f457a00ce2d&expiration=Tue%2C+04+Mar+2025+06%3A40%3A19+GMT&signature=3fe1b5f59f4c97236b73dc288be78fd3d1802a8fe42fc10db5d3342a8854fd50
1 Like
Possibly .push_to_hub() issue? @lhoestq
Hi
We recently bought an enterprise subscription in order to increase our data usage limits.
However, it is no longer possible for me to push large datasets to the hub (500+ GB), as it eventually crashes with a β504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/{organization}/{dataset}/preupload/main β
I donβt know if this have anything to do with your recent change in data usage limits policy. I see that you write that you have to contact you in order to upload dβ¦
Iβm considering writing code to
build and save the large dataset to local hdd
Try this to upload it without the fails Upload files to the Hub
Good idea?
1 Like
(note Iβm calling push_to_hub on a DatasetDict, to an enterprise organization)
1 Like
Good idea?
Yea. In that case, you could use this.
cc @Wauplin maybe a transient 504 issue ?
Anyway @davidhhmack making push_to_hub easier to resume is definitely something weβll look into !
1 Like
Thanks, I appreciate it! Itβs crushing to get hours through an upload, then have it die
2 Likes