Hi all,
I really love the datasets package! One issue that I’ve frequently run into recently is when using the a dataset in streaming mode. One case, I get throttled, similar to the error message below
huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/zpn/uniref50
which seems odd since I’m only using 1 node of 8 GPUs.
The other case is after about 10 or 15 hours, it seems that I get disconnected and then the file is incorrectly downloaded, similar to
Got disconnected from remote data host. Retrying in 5sec [2/20]
Got disconnected from remote data host. Retrying in 5sec [3/20]
Got disconnected from remote data host. Retrying in 5sec [4/20]
Got disconnected from remote data host. Retrying in 5sec [1/20]
Got disconnected from remote data host. Retrying in 5sec [2/20]
Failed to read file 'zstd://shard_00000.jsonl::https://huggingface.co/datasets/gonzalobenegas/mammalian-genomes-cds/resolve/4cc9cfe7b5377c3dda6040345a9a6a5546e7e162/data/validation/shard_00000.jsonl.zst' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column() changed from object to string in row 0
1545it [13:44, 1.87it/s]
This is after many cycles through of this particular dataset so it seems like the file is getting partially downloaded and then read (which raises the error).
Are there any workarounds to this? Training these models is costly especially with having to restart.