Intermittent drop outs / slow downloads via datasets server

jonna32 · February 13, 2025, 10:09am

We are continously querying the datasets server from around 260 instances at any given time and we keep having drop outs in connection, or intermittent very slow downloads. Might there be an explanation for this? Is there anything we can do ourselves to ensure stability?

John6666 · February 13, 2025, 10:57am

Wouldn’t it be easy to create a function that catches HTTPError exceptions and retries? I don’t think there was a built-in retry function…

https://requests.readthedocs.io/en/latest/api/#requests.HTTPError

jonna32 · February 14, 2025, 12:09pm

We already have a retry mechanism setup actually. Was curious if there are know drop outs on the datasets server, that would result in applications using it not working

alan-aboudib · February 14, 2025, 2:04pm

Have you noticed any improvement? I am having similar issues with FineWeb datasets

jonna32 · February 17, 2025, 11:41am

No this is an ongoing issue. It comes and goes. Sometimes good trustworthy connection, sometimes not

jonna32 · February 21, 2025, 6:44am

Would really love to have more opinions on this since we are considering moving away from HF because of this

severo · February 24, 2025, 8:37am

cc @lhoestq

lhoestq · February 24, 2025, 11:25am

The dataset viewer is an API to view parts of the data, not download them all.

If you’re looking for a reliable download, you may be interested in downloading the data files directly, or use tools like pyspark, dask, datasets etc.

EDIT: I noticed that the load is pretty high at the moment, we might consider rate-limiting the /rows endpoint if it goes on like this, to ensure that HF users have the best experience with the dataset viewer UI

Topic		Replies	Views
Unable to Train for a Long Time 🤗Datasets	4	1893	February 16, 2023
Error Handling in IterableDataset? 🤗Datasets	3	438	February 12, 2024
Out of no where: requests.exceptions.ReadTimeout: HTTPSConnectionPool (host='huggingface.co', port=443): Read timed out 🤗Datasets	13	33954	July 29, 2024
Can't automatically load_dataset due to network 🤗Datasets	1	4832	April 7, 2022
How to resume an interrupted download 🤗Datasets	1	3775	June 6, 2023

Intermittent drop outs / slow downloads via datasets server

Related topics