We are continously querying the datasets server from around 260 instances at any given time and we keep having drop outs in connection, or intermittent very slow downloads. Might there be an explanation for this? Is there anything we can do ourselves to ensure stability?
Wouldn’t it be easy to create a function that catches HTTPError exceptions and retries? I don’t think there was a built-in retry function…
https://requests.readthedocs.io/en/latest/api/#requests.HTTPError
We already have a retry mechanism setup actually. Was curious if there are know drop outs on the datasets server, that would result in applications using it not working
Have you noticed any improvement? I am having similar issues with FineWeb datasets
No this is an ongoing issue. It comes and goes. Sometimes good trustworthy connection, sometimes not
Would really love to have more opinions on this since we are considering moving away from HF because of this
cc @lhoestq
The dataset viewer is an API to view parts of the data, not download them all.
If you’re looking for a reliable download, you may be interested in downloading the data files directly, or use tools like pyspark
, dask
, datasets
etc.
EDIT: I noticed that the load is pretty high at the moment, we might consider rate-limiting the /rows endpoint if it goes on like this, to ensure that HF users have the best experience with the dataset viewer UI