I am trying to push a large dataset with help of
dataset.push_to_hub()
While pushing, it gives HTTP error :504
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/name/repo-name/upload/main/data/train2-00041-of-00064.parquet
How to avoid this whenever pushing large datasets
Hi ! As discussed on GitHub, feel free to try again (the server might have had some issues).
We’re also adding a retry mechanism to workaround 504 errors: Retry HfApi call inside push_to_hub when 504 error by albertvillanova · Pull Request #3886 · huggingface/datasets · GitHub
We’ll do a new release of datasets
soon to include this
1 Like
I’ve been getting this today (200GB dataset, push in python), just love to raise this is still a problem (and it’d be so great to be able to resume pushes since they can take a really long time)
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████| 48/48 [00:00<00:00, 329.12ba/s]
Uploading the dataset shards: 46%|██████████████████████▊ | 192/420 [6:08:02<7:17:03, 115.01s/it]
Traceback (most recent call last):
File “/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py”, line 409, in hf_raise_for_status
response.raise_for_status()
File “/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/complete_multipart?uploadId=ak3yajyFaNpfIpmsLljywyUVb1sNay5D5GUGKAjJcM1h2GjHSHWLUuXNQ2.rxlPK9Mydu5w.5iCJ5P9SRIw4tbO2Gk9bXPRqPshRZlskHKa.tpGImodDyUjcU1yH92La&bucket=hf-hub-lfs-us-east-1&prefix=repos%2Fbe%2Feb%2Fbeebea985423ba0ccb6c7ef0c6925225a4dcd28c85e8c4db44607f457a00ce2d&expiration=Tue%2C+04+Mar+2025+06%3A40%3A19+GMT&signature=3fe1b5f59f4c97236b73dc288be78fd3d1802a8fe42fc10db5d3342a8854fd50
1 Like
Possibly .push_to_hub() issue? @lhoestq
Hi
We recently bought an enterprise subscription in order to increase our data usage limits.
However, it is no longer possible for me to push large datasets to the hub (500+ GB), as it eventually crashes with a “504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/{organization}/{dataset}/preupload/main ”
I don’t know if this have anything to do with your recent change in data usage limits policy. I see that you write that you have to contact you in order to upload d…
I’m considering writing code to
build and save the large dataset to local hdd
Try this to upload it without the fails Upload files to the Hub
Good idea?
1 Like
(note I’m calling push_to_hub on a DatasetDict, to an enterprise organization)
1 Like
Good idea?
Yea. In that case, you could use this.
cc @Wauplin maybe a transient 504 issue ?
Anyway @davidhhmack making push_to_hub easier to resume is definitely something we’ll look into !
1 Like
Thanks, I appreciate it! It’s crushing to get hours through an upload, then have it die
2 Likes