Unable to Train for a Long Time

Hi all,

I really love the datasets package! One issue that I’ve frequently run into recently is when using the a dataset in streaming mode. One case, I get throttled, similar to the error message below

huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/zpn/uniref50

which seems odd since I’m only using 1 node of 8 GPUs.

The other case is after about 10 or 15 hours, it seems that I get disconnected and then the file is incorrectly downloaded, similar to

Got disconnected from remote data host. Retrying in 5sec [2/20]
Got disconnected from remote data host. Retrying in 5sec [3/20]
Got disconnected from remote data host. Retrying in 5sec [4/20]
Got disconnected from remote data host. Retrying in 5sec [1/20]
Got disconnected from remote data host. Retrying in 5sec [2/20]
Failed to read file 'zstd://shard_00000.jsonl::https://huggingface.co/datasets/gonzalobenegas/mammalian-genomes-cds/resolve/4cc9cfe7b5377c3dda6040345a9a6a5546e7e162/data/validation/shard_00000.jsonl.zst' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column() changed from object to string in row 0
1545it [13:44,  1.87it/s]

This is after many cycles through of this particular dataset so it seems like the file is getting partially downloaded and then read (which raises the error).

Are there any workarounds to this? Training these models is costly especially with having to restart.

Hi ! I opened a PR a while ago to ignore bad chunks of JSON data if there is an issue while streaming but never got it to merge: Add error_bad_chunk to the JSON loader by lhoestq · Pull Request #2838 · huggingface/datasets · GitHub. Let me know if something like this would help

Regarding rate limiting, please make sure you don’t have another shell or someone on the same network as you spamming Hugging Face ^^’

Hm that’s odd that I face the rate limiting from a 1 node scenario. Is there a hard limit to the number of requests per second?

And yeah something like this would definitely be helpful. Is it possible to retry/remove a download in the case it’s corrupted?

If the jobs keep failing, I’m going to unfortunately have to move off of datasets and reimplement everything to read from S3 or something like that

Is there a hard limit to the number of requests per second?

We faced some spam from bots recently, so we had to hardcode a limit (at least for free users).

And yeah something like this would definitely be helpful. Is it possible to retry/remove a download in the case it’s corrupted?

In the PR I linked it skips the corrupted batch, but we can probably extend it to allow to retry instead by seeking to the latest read location.

That would also be great!