HTTP 504: Gateway timeout error when pushing dataset

I am trying to push a large dataset with help of

dataset.push_to_hub()

While pushing, it gives HTTP error :504

requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/name/repo-name/upload/main/data/train2-00041-of-00064.parquet

How to avoid this whenever pushing large datasets

Hi ! As discussed on GitHub, feel free to try again (the server might have had some issues).

We’re also adding a retry mechanism to workaround 504 errors: Retry HfApi call inside push_to_hub when 504 error by albertvillanova Β· Pull Request #3886 Β· huggingface/datasets Β· GitHub

We’ll do a new release of datasets soon to include this :slight_smile:

1 Like

I’ve been getting this today (200GB dataset, push in python), just love to raise this is still a problem (and it’d be so great to be able to resume pushes since they can take a really long time)

Creating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 48/48 [00:00<00:00, 329.12ba/s]
Uploading the dataset shards: 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 192/420 [6:08:02<7:17:03, 115.01s/it]
Traceback (most recent call last):
File β€œ/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py”, line 409, in hf_raise_for_status
response.raise_for_status()
File β€œ/Users/dmackparty/dev/presto-py/app/pipelines/.venv/lib/python3.12/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/complete_multipart?uploadId=ak3yajyFaNpfIpmsLljywyUVb1sNay5D5GUGKAjJcM1h2GjHSHWLUuXNQ2.rxlPK9Mydu5w.5iCJ5P9SRIw4tbO2Gk9bXPRqPshRZlskHKa.tpGImodDyUjcU1yH92La&bucket=hf-hub-lfs-us-east-1&prefix=repos%2Fbe%2Feb%2Fbeebea985423ba0ccb6c7ef0c6925225a4dcd28c85e8c4db44607f457a00ce2d&expiration=Tue%2C+04+Mar+2025+06%3A40%3A19+GMT&signature=3fe1b5f59f4c97236b73dc288be78fd3d1802a8fe42fc10db5d3342a8854fd50

1 Like

Possibly .push_to_hub() issue? @lhoestq

I’m considering writing code to

  1. build and save the large dataset to local hdd
  2. Try this to upload it without the fails Upload files to the Hub

Good idea?

1 Like

(note I’m calling push_to_hub on a DatasetDict, to an enterprise organization)

1 Like

Good idea?

Yea. In that case, you could use this.

cc @Wauplin maybe a transient 504 issue ?

Anyway @davidhhmack making push_to_hub easier to resume is definitely something we’ll look into !

1 Like

Thanks, I appreciate it! It’s crushing to get hours through an upload, then have it die

2 Likes