Out of no where: requests.exceptions.ReadTimeout: HTTPSConnectionPool (host='huggingface.co', port=443): Read timed out

Hi, Im streaming laion2b dataset using:

self.dataset = load_dataset("laion/laion2b-en", streaming=True,split="train")

And Im getting this error:

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out.

This is not he interesting part, whats interesting is that it worked for two weeks in a row, out of no where, the streaming stopped and now I cant run (getting error above).
My network manager says nothing changed in the configuration/proxy or anything else, did something change from the “datasets” package side?

The full trace is:

  File "/workspace/dir/dir_env/lib/python3.8/site-packages/datasets/load.py", line 1502, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/datasets/load.py", line 1219, in dataset_module_factory
    raise e1 from None
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/datasets/load.py", line 1186, in dataset_module_factory
    raise e
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/datasets/load.py", line 1160, in dataset_module_factory
    dataset_info = hf_api.dataset_info(
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 1666, in dataset_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
    return self.request("GET", url, **kwargs)
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/workspace/dir/dir_env/lib/python3.8/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=100.0)
4 Likes

I am facing the same issue with cerebras/SlimPajama-627B
python3.9
datasets: 2.12.0
huggingface 0.0.1
huggingface-hub 0.13.4

Facing the same issue with tiiuae/falcon-7b-instruct. Working on HuggingFace 0.0.1 with similar boilerplate as mentioned above.

The models that I have already pre-installed locally seems to work fine though.

The Hub had a minor outage. Can you please try again and report if it works now?

Thanks for you help, unfortunately the problem still occurs.

1 Like

transformers==4.30.2 works for me :slight_smile:

1 Like

We are seeing the same issue in some CI jobs that leverage huggingface models, for example:

I have the same issue right now (I also have problem when I want to upload file/ folder / model / dataset, there are other times when model and dataset is loaded to cache, cloning to my hub repo to local is done but out of nowhere the job is killed and training does not start)

is there a way to monitor when there is an outage - a webpage with outages that are happening so that we don’t waste time on looking for solutions?

You can check the status here: https://status.huggingface.co/

1 Like

this is useful thank you, @mariosasko

1 Like

Has anyone figured out how to resolve this issue when it appears? According to the HuggingFace Status page, everything is currently operational.

I started having the problem today with a training job in SageMaker which worked last night and the only thing that has changed is the dataset I’m using for training and that downloaded fine. It times out part way through downloading safetensors files for meta-llama/Llama-2-7b-hf.

@mariosasko Any other insights or potential awareness of an outage that is not showing up on the status page yet? Thanks!

Using the following versions:

accelerate-0.21.0
bitsandbytes-0.40.2 
huggingface-hub-0.17.3 
optimum-1.13.2 
peft-0.4.0 
safetensors-0.3.3 
sagemaker-training-4.7.0 
transformers-4.33.3

Error Log:

ErrorMessage "TimeoutError: The read operation timed out
 
 During handling of the above exception, another exception occurred
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 816, in generate
 yield from self.raw.stream(chunk_size, decode_content=True)
 File "/opt/conda/lib/python3.10/site-packages/urllib3/response.py", line 628, in stream
 data = self.read(amt=amt, decode_content=decode_content)
 File "/opt/conda/lib/python3.10/site-packages/urllib3/response.py", line 566, in read
 with self._error_catcher()
 File "/opt/conda/lib/python3.10/contextlib.py", line 153, in __exit__
 self.gen.throw(typ, value, traceback)
 File "/opt/conda/lib/python3.10/site-packages/urllib3/response.py", line 449, in _error_catcher
 raise ReadTimeoutError(self._pool, None, "Read timed out.")
 urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.
 File "/opt/ml/code/run_clm.py", line 362, in <module>
 main()
 File "/opt/ml/code/run_clm.py", line 358, in main
 raise e
 File "/opt/ml/code/run_clm.py", line 347, in main
 training_function(args)
 File "/opt/ml/code/run_clm.py", line 229, in training_function
 model = AutoModelForCausalLM.from_pretrained(
 File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
 return model_class.from_pretrained(
 File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2869, in from_pretrained
 resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
 File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 1040, in get_checkpoint_shard_files
 cached_filename = cached_file(
 File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 429, in cached_file
 resolved_file = hf_hub_download(
 File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
 return fn(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1431, in hf_hub_download
 http_get(
 File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 551, in http_get
 for chunk in r.iter_content(chunk_size=10 * 1024 * 1024)
 File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 822, in generate
 raise ConnectionError(e)
 requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.
 Downloading (…)of-00002.safetensors:  54%|█████▎    | 5.35G/9.98G [00:21<00:18, 252MB/s]"

One solution is to add the parameter of resume_download=True in from_pretrained or where the error occurs and then just rerun the code. Suppose your download was finished in 10% progress previously, now upon rerunning the code it will not start from the beginning, rather it will continue downloading from 10% or where it was finished previously.
For example:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5,resume_download=True)
2 Likes

Any solutions you got?