Getting 429 Too Many Request

Hi team, I am keep getting

e[36mray::_RayTrainWorker__execute.get_next()e[39m (pid=1155, ip=240.53.46.192, actor_id=0e6f5f4d69a784aa81d4752602000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f32bf780670>)
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/file_download.py”, line 1543, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py”, line 114, in _inner_fn
return fn(*args, **kwargs)
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/file_download.py”, line 1460, in get_hf_file_metadata
r = _request_wrapper(
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/file_download.py”, line 283, in _request_wrapper
response = _request_wrapper(
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/file_download.py”, line 307, in _request_wrapper
hf_raise_for_status(response)
File “/home/ray/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_http.py”, line 475, in hf_raise_for_status
raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: ``https://huggingface.co/Qwen/Qwen3-32B/resolve/main/config.json

Upgraded to Pro account does not help, anything else I can do to unblock myself? It seems to be IP based, I tried to switch account but doesn’t work.

1 Like

Hmm
 Error 429 is a common error, but it’s not often seen when loading models. Maybe it’s being loaded repeatedly from within a loop
?


You are hitting Hugging Face Hub rate limits on model file downloads.
Upgrading to Pro does not fix your current behavior because your Ray workers are collectively sending too many /resolve/... requests from the same IP, so the Hub is throttling that IP/token.

Below is the background, what is actually happening, and concrete steps to unblock yourself.


1. What this 429 means on Hugging Face

Your stack trace shows:

  • .../huggingface_hub/file_download.py -> get_hf_file_metadata -> _request_wrapper -> hf_raise_for_status
  • Final error:
    HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-32B/resolve/main/config.json

Key points:

  • The URL contains /resolve/main/.... These are “resolver” endpoints used to fetch model files and metadata from the Hub.(Hugging Face)
  • Hugging Face defines rate limits per 5-minute window for different action types. For you, the relevant bucket is Resolvers (file downloads and metadata).(Hugging Face)
  • HTTP 429 in this context means: too many resolver requests from your IP or token in a short time. It is not a permission or “you must buy Pro” error. HF staff say exactly this in several threads.(Hugging Face Forums)

Even Pro accounts have resolver limits. Pro gives a higher ceiling, but if your code is very aggressive (many workers, no cache), you can still exceed that ceiling and get 429s.


2. Why it’s happening in your Ray + Qwen3-32B setup

You have:

  • A large model: Qwen/Qwen3-32B (many shards + config files)
  • Ray workers: each worker running training code (ray::_RayTrainWorker__execute.get_next)

Typical pattern in this setup:

  1. Each worker calls something like AutoModel.from_pretrained("Qwen/Qwen3-32B") or similar.

  2. transformers → huggingface_hub:

    • For every file it needs, it calls get_hf_file_metadata to resolve the file via /resolve/main/....
  3. When many workers do this at the same time, you get thousands of HTTP requests in a few minutes:

    • HEAD / GET on config.json
    • HEAD / GET on tokenizer files
    • HEAD / GET on each model shard
  4. The Hub sees this as a high-volume client from a single IP or token. Once the 5-minute resolver quota is exceeded, it starts returning 429.

This pattern is exactly what shows up in other issues:

  • Training on SlimPajama / large datasets on TPUs: many small files + many processes ⇒ 429 on downloads.(GitHub)
  • Production systems hitting HEAD / metadata for every request with vLLM / HF Hub ⇒ HEAD storms ⇒ 429.(GitHub)
  • Users downloading big models (DeepSeek, LLaMA, Falcon, etc.) from clusters ⇒ same 429 on /resolve/main/config.json.(GitHub)

So your code is not “wrong” in a functional sense. It is just too chatty with the Hub for the plan and environment you are using.


3. Why Pro and switching accounts did not help

You observed:

  • Upgrading to Pro did not fix it.
  • Switching accounts on the same machine/IP did not fix it.

This matches how HF rate limits work:

  1. Per-IP effects
    Several HF threads and issues confirm that the Hub often enforces limits by IP or IP+token combination. If you hammer from one IP (e.g., a cloud VM or NAT gateway), switching HF accounts does not remove the IP’s request history in the current window.(Hugging Face Forums)

  2. Anonymous vs authenticated traffic
    If your Ray workers are not actually using your token (no HF_TOKEN in those processes), they are counted as anonymous traffic, which has much lower limits than Pro auth traffic.(Hugging Face)

  3. Pro increases quota but does not remove limits
    The rate limits docs are clear: each tier has higher quotas, but everyone has finite limits per 5-minute window. If your pattern is “download or metadata-check the entire big model from scratch on many workers,” you can blow through even Pro’s resolver quota.(Hugging Face)

So Pro is necessary for sustained heavy use, but not sufficient if your access pattern is inefficient.


4. Immediate unblocking

Short term you have two constraints:

  1. The current 5-minute window

    • When you hit 429, the Hub sends RateLimit headers telling you how many seconds are left until reset.
    • huggingface_hub>=1.2.0 can automatically read these headers and sleep until reset before retrying.(Hugging Face)
  2. Possible longer cool-down

    • If you repeatedly hit 429 hard, HF may enforce longer (hours) or more strict protection for that IP or token, as seen in some DeepSeek / dataset threads.(Hugging Face Forums)

You cannot override the Hub from your side. What you can do is:

  • Stop the Ray job that is spamming requests.
  • Allow some time for the limit window to reset.
  • Before restarting, change your download pattern as in the next section so you do not immediately hit 429 again.

5. Concrete long-term fixes for your Ray + Qwen setup

Think in terms of “reduce Hub requests per 5 minutes”:

5.1 Make sure all workers are authenticated (no anonymous traffic)

You want all calls to use your Pro quota, not the anonymous bucket.

On every Ray node (driver + workers), ensure:

export HF_TOKEN=hf_your_token_here   # read access is enough

or in Python before Ray starts:

import os
os.environ["HF_TOKEN"] = "hf_your_token_here"

You can verify inside a worker:

from huggingface_hub import whoami

print(whoami())  # should show your account, not None / anonymous

If this prints an error or anonymous info, then your Pro plan is not being used by that process.(Hugging Face)


5.2 Use a shared cache and download once, not per worker

Goal: One download from the Hub, many reuses from disk.

  1. Choose a shared directory accessible by all workers on a node or cluster, e.g.:

    export HF_HOME=/srv/hf-cache
    

    Or explicitly:

    export HF_HUB_CACHE=/srv/hf-cache
    

    The Hub docs define these vars and recommend them for controlling cache location.(Hugging Face)

  2. Pre-download the model once in a separate preparation step:

    from huggingface_hub import snapshot_download
    import os
    
    os.environ["HF_TOKEN"] = "hf_your_token_here"
    
    snapshot_download(
        "Qwen/Qwen3-32B",
        local_dir="/srv/hf-cache/Qwen3-32B",
        local_dir_use_symlinks=False,
        token=os.environ["HF_TOKEN"],
    )
    

    This is the pattern recommended in various guides and cluster examples (HPC / offline use).(deepnote.com)

  3. In your Ray training code, load only from that local path:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    local_path = "/srv/hf-cache/Qwen3-32B"
    
    tokenizer = AutoTokenizer.from_pretrained(local_path, local_files_only=True)
    model = AutoModelForCausalLM.from_pretrained(local_path, local_files_only=True)
    

    local_files_only=True instructs transformers/huggingface_hub to not call the Hub at all if files are present. That removes resolver traffic during training.(deepnote.com)

  4. Ensure Ray workers see the same path:

    • If using Ray on a single machine: mount /srv/hf-cache locally.
    • If using nodes: mount via NFS, EFS, Lustre, etc., or sync the cache once per node.

This shared-cache pattern is the main technique HF itself suggests to avoid repeated downloads and rate limits in multi-node scenarios.(GitHub)


5.3 Serialize or cap concurrent downloads

If you cannot fully pre-download, at least avoid many parallel snapshot/download calls.

Pattern:

from filelock import FileLock
from huggingface_hub import snapshot_download

lock = FileLock("/srv/hf-cache/Qwen3-32B.lock")

with lock:
    snapshot_download(
        "Qwen/Qwen3-32B",
        local_dir="/srv/hf-cache/Qwen3-32B",
        local_dir_use_symlinks=False,
    )
  • All workers share the same lock file.
  • Only the first one actually talks to the Hub; others wait and then see the local files in the cache.

This is similar in spirit to PRs that reduce filesystem calls in dataset scripts to fix 429 “Too Many Requests” errors.(Hugging Face)


5.4 Limit Ray’s model-loading pattern

Avoid doing from_pretrained("Qwen/Qwen3-32B") inside the inner loop of each Ray task.

Better:

  • Load the model once per long-lived worker, then reuse it.
  • Do not spawn and tear down many short-lived workers that each load the model from scratch.
  • Avoid multiple calls that implicitly trigger metadata checks on the Hub (even if the weights are cached). vLLM and others have hit 429 just from repeated HEAD requests.(GitHub)

5.5 Upgrade huggingface_hub and let it handle 429s gracefully

Install a recent version:

pip install -U "huggingface_hub"

From version 1.2.0, the library:

  • Parses the RateLimit headers on 429,
  • Sleeps exactly until reset,
  • Retries automatically.(Hugging Face)

This does not change your quota, but it avoids hard crashes when you are only slightly over. Combine this with fewer requests, and your job should run smoothly.


5.6 If you use mirrors or custom endpoints, treat them the same

You mentioned it feels IP-based and changing accounts does not help. That also matches reports where:

  • People got 429 even when using HF mirrors or different endpoints, because the underlying CDN or mirror still saw too many requests from the same IP range.(CSDN)

So do not rely on mirrors alone to “work around” the issue. They still need fewer requests and caching.


5.7 Network / IPv6 corner cases

Some HF users on cloud environments report:

  • Getting 429 on almost any call from a specific IPv6 address block, while IPv4 works better.
  • Solving issues by disabling IPv6 or using a different network path.(Hugging Face Forums)

This is not always needed, but if after fixing caching and concurrency you still get 429 instantly, testing from a different machine or with IPv6 disabled is a useful diagnostic.


6. How to confirm you are fixed

After you implement the changes:

  1. Start from a clean job where:

    • The model is already fully downloaded to /srv/hf-cache/Qwen3-32B.
    • Your training loop uses local_files_only=True.
  2. Run one Ray worker and verify:

    • No 429 errors.
    • No external network traffic to huggingface.co during training (you can check logs or network).
  3. Scale up workers:

    • If 429 reappears only when scaling, you still have hidden points where each worker hits the Hub (e.g., separate caches or extra from_pretrained calls).

If everything is local, 429 should disappear completely, because no resolver calls reach the Hub.


7. When to contact Hugging Face support

If, after:

  • Using HF_TOKEN in every process,
  • Using a shared cache + local_files_only=True,
  • Limiting concurrent downloads,
  • Running a current huggingface_hub,

you still get 429s very early in the job, then:

  • Collect example request IDs from the 429 responses if possible.
  • Note your username, plan, and approximate time windows.
  • Open a ticket or post on the Hub / Transformers forum.

HF staff regularly handle “I’m seeing unexpected 429s” cases and will check logs and limits on their side.(Hugging Face)


Short summary

  • The error is a Hub resolver rate limit (HTTP 429) on /resolve/main/config.json, caused by many Ray workers hitting the same big model from the same IP.(Hugging Face)

  • Pro raises the quota but does not remove limits, and anonymous or shared-IP traffic can still be throttled.(Hugging Face)

  • To unblock and prevent recurrence:

    • Ensure all workers use HF_TOKEN.
    • Pre-download Qwen/Qwen3-32B once into a shared cache (HF_HOME / snapshot_download).
    • Load from that path with local_files_only=True so training never touches the Hub.
    • Reduce parallel downloads (serialize snapshot_download, reuse long-lived workers).
    • Upgrade huggingface_hub so 429s trigger proper wait-and-retry instead of hard failure.(Hugging Face)