Unable to Train for a Long Time

zpn · February 16, 2023, 2:05am

Hi all,

I really love the datasets package! One issue that I’ve frequently run into recently is when using the a dataset in streaming mode. One case, I get throttled, similar to the error message below

huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/zpn/uniref50

which seems odd since I’m only using 1 node of 8 GPUs.

The other case is after about 10 or 15 hours, it seems that I get disconnected and then the file is incorrectly downloaded, similar to

Got disconnected from remote data host. Retrying in 5sec [2/20]
Got disconnected from remote data host. Retrying in 5sec [3/20]
Got disconnected from remote data host. Retrying in 5sec [4/20]
Got disconnected from remote data host. Retrying in 5sec [1/20]
Got disconnected from remote data host. Retrying in 5sec [2/20]
Failed to read file 'zstd://shard_00000.jsonl::https://huggingface.co/datasets/gonzalobenegas/mammalian-genomes-cds/resolve/4cc9cfe7b5377c3dda6040345a9a6a5546e7e162/data/validation/shard_00000.jsonl.zst' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column() changed from object to string in row 0
1545it [13:44,  1.87it/s]

This is after many cycles through of this particular dataset so it seems like the file is getting partially downloaded and then read (which raises the error).

Are there any workarounds to this? Training these models is costly especially with having to restart.

lhoestq · February 16, 2023, 12:22pm

Hi ! I opened a PR a while ago to ignore bad chunks of JSON data if there is an issue while streaming but never got it to merge: Add error_bad_chunk to the JSON loader by lhoestq · Pull Request #2838 · huggingface/datasets · GitHub. Let me know if something like this would help

Regarding rate limiting, please make sure you don’t have another shell or someone on the same network as you spamming Hugging Face ^^’

zpn · February 16, 2023, 3:05pm

Hm that’s odd that I face the rate limiting from a 1 node scenario. Is there a hard limit to the number of requests per second?

And yeah something like this would definitely be helpful. Is it possible to retry/remove a download in the case it’s corrupted?

If the jobs keep failing, I’m going to unfortunately have to move off of datasets and reimplement everything to read from S3 or something like that

lhoestq · February 16, 2023, 5:41pm

Is there a hard limit to the number of requests per second?

We faced some spam from bots recently, so we had to hardcode a limit (at least for free users).

And yeah something like this would definitely be helpful. Is it possible to retry/remove a download in the case it’s corrupted?

In the PR I linked it skips the corrupted batch, but we can probably extend it to allow to retry instead by seeking to the latest read location.

zpn · February 16, 2023, 11:59pm

That would also be great!

Topic		Replies	Views
Problem loading HuggingFaceFW/fineweb-edu-score-2 dataset: Too Many Requests 🤗Datasets	1	70	March 22, 2025
Reading time outs 443 and 503 🤗Datasets	1	102	December 11, 2024
RuntimeError: Error while uploading 'data/train-00040-of-00157-15109dabc9b3967a.parquet' to the Hub 🤗Datasets	2	392	November 28, 2024
Exceeded our hourly quotas for action while loading dataset to HF Hub 🤗Datasets	9	1433	November 7, 2023
ConnectionError: Couldn't reach https://huggingface.co/datasets/segments/sidewalk-semantic/resolve/main/dataset_infos.json (error 403) Beginners	9	1565	November 25, 2024

Unable to Train for a Long Time

Related topics