Hmm⊠Error 429 is a common error, but itâs not often seen when loading models. Maybe itâs being loaded repeatedly from within a loopâŠ?
You are hitting Hugging Face Hub rate limits on model file downloads.
Upgrading to Pro does not fix your current behavior because your Ray workers are collectively sending too many /resolve/... requests from the same IP, so the Hub is throttling that IP/token.
Below is the background, what is actually happening, and concrete steps to unblock yourself.
1. What this 429 means on Hugging Face
Your stack trace shows:
.../huggingface_hub/file_download.py -> get_hf_file_metadata -> _request_wrapper -> hf_raise_for_status
- Final error:
HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-32B/resolve/main/config.json
Key points:
- The URL contains
/resolve/main/.... These are âresolverâ endpoints used to fetch model files and metadata from the Hub.(Hugging Face)
- Hugging Face defines rate limits per 5-minute window for different action types. For you, the relevant bucket is Resolvers (file downloads and metadata).(Hugging Face)
- HTTP 429 in this context means: too many resolver requests from your IP or token in a short time. It is not a permission or âyou must buy Proâ error. HF staff say exactly this in several threads.(Hugging Face Forums)
Even Pro accounts have resolver limits. Pro gives a higher ceiling, but if your code is very aggressive (many workers, no cache), you can still exceed that ceiling and get 429s.
2. Why itâs happening in your Ray + Qwen3-32B setup
You have:
- A large model:
Qwen/Qwen3-32B (many shards + config files)
- Ray workers: each worker running training code (
ray::_RayTrainWorker__execute.get_next)
Typical pattern in this setup:
-
Each worker calls something like AutoModel.from_pretrained("Qwen/Qwen3-32B") or similar.
-
transformers â huggingface_hub:
- For every file it needs, it calls
get_hf_file_metadata to resolve the file via /resolve/main/....
-
When many workers do this at the same time, you get thousands of HTTP requests in a few minutes:
- HEAD / GET on
config.json
- HEAD / GET on tokenizer files
- HEAD / GET on each model shard
-
The Hub sees this as a high-volume client from a single IP or token. Once the 5-minute resolver quota is exceeded, it starts returning 429.
This pattern is exactly what shows up in other issues:
- Training on SlimPajama / large datasets on TPUs: many small files + many processes â 429 on downloads.(GitHub)
- Production systems hitting HEAD / metadata for every request with vLLM / HF Hub â HEAD storms â 429.(GitHub)
- Users downloading big models (DeepSeek, LLaMA, Falcon, etc.) from clusters â same 429 on
/resolve/main/config.json.(GitHub)
So your code is not âwrongâ in a functional sense. It is just too chatty with the Hub for the plan and environment you are using.
3. Why Pro and switching accounts did not help
You observed:
- Upgrading to Pro did not fix it.
- Switching accounts on the same machine/IP did not fix it.
This matches how HF rate limits work:
-
Per-IP effects
Several HF threads and issues confirm that the Hub often enforces limits by IP or IP+token combination. If you hammer from one IP (e.g., a cloud VM or NAT gateway), switching HF accounts does not remove the IPâs request history in the current window.(Hugging Face Forums)
-
Anonymous vs authenticated traffic
If your Ray workers are not actually using your token (no HF_TOKEN in those processes), they are counted as anonymous traffic, which has much lower limits than Pro auth traffic.(Hugging Face)
-
Pro increases quota but does not remove limits
The rate limits docs are clear: each tier has higher quotas, but everyone has finite limits per 5-minute window. If your pattern is âdownload or metadata-check the entire big model from scratch on many workers,â you can blow through even Proâs resolver quota.(Hugging Face)
So Pro is necessary for sustained heavy use, but not sufficient if your access pattern is inefficient.
4. Immediate unblocking
Short term you have two constraints:
-
The current 5-minute window
- When you hit 429, the Hub sends
RateLimit headers telling you how many seconds are left until reset.
huggingface_hub>=1.2.0 can automatically read these headers and sleep until reset before retrying.(Hugging Face)
-
Possible longer cool-down
- If you repeatedly hit 429 hard, HF may enforce longer (hours) or more strict protection for that IP or token, as seen in some DeepSeek / dataset threads.(Hugging Face Forums)
You cannot override the Hub from your side. What you can do is:
- Stop the Ray job that is spamming requests.
- Allow some time for the limit window to reset.
- Before restarting, change your download pattern as in the next section so you do not immediately hit 429 again.
5. Concrete long-term fixes for your Ray + Qwen setup
Think in terms of âreduce Hub requests per 5 minutesâ:
5.1 Make sure all workers are authenticated (no anonymous traffic)
You want all calls to use your Pro quota, not the anonymous bucket.
On every Ray node (driver + workers), ensure:
export HF_TOKEN=hf_your_token_here # read access is enough
or in Python before Ray starts:
import os
os.environ["HF_TOKEN"] = "hf_your_token_here"
You can verify inside a worker:
from huggingface_hub import whoami
print(whoami()) # should show your account, not None / anonymous
If this prints an error or anonymous info, then your Pro plan is not being used by that process.(Hugging Face)
5.2 Use a shared cache and download once, not per worker
Goal: One download from the Hub, many reuses from disk.
-
Choose a shared directory accessible by all workers on a node or cluster, e.g.:
export HF_HOME=/srv/hf-cache
Or explicitly:
export HF_HUB_CACHE=/srv/hf-cache
The Hub docs define these vars and recommend them for controlling cache location.(Hugging Face)
-
Pre-download the model once in a separate preparation step:
from huggingface_hub import snapshot_download
import os
os.environ["HF_TOKEN"] = "hf_your_token_here"
snapshot_download(
"Qwen/Qwen3-32B",
local_dir="/srv/hf-cache/Qwen3-32B",
local_dir_use_symlinks=False,
token=os.environ["HF_TOKEN"],
)
This is the pattern recommended in various guides and cluster examples (HPC / offline use).(deepnote.com)
-
In your Ray training code, load only from that local path:
from transformers import AutoModelForCausalLM, AutoTokenizer
local_path = "/srv/hf-cache/Qwen3-32B"
tokenizer = AutoTokenizer.from_pretrained(local_path, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(local_path, local_files_only=True)
local_files_only=True instructs transformers/huggingface_hub to not call the Hub at all if files are present. That removes resolver traffic during training.(deepnote.com)
-
Ensure Ray workers see the same path:
- If using Ray on a single machine: mount
/srv/hf-cache locally.
- If using nodes: mount via NFS, EFS, Lustre, etc., or sync the cache once per node.
This shared-cache pattern is the main technique HF itself suggests to avoid repeated downloads and rate limits in multi-node scenarios.(GitHub)
5.3 Serialize or cap concurrent downloads
If you cannot fully pre-download, at least avoid many parallel snapshot/download calls.
Pattern:
from filelock import FileLock
from huggingface_hub import snapshot_download
lock = FileLock("/srv/hf-cache/Qwen3-32B.lock")
with lock:
snapshot_download(
"Qwen/Qwen3-32B",
local_dir="/srv/hf-cache/Qwen3-32B",
local_dir_use_symlinks=False,
)
- All workers share the same lock file.
- Only the first one actually talks to the Hub; others wait and then see the local files in the cache.
This is similar in spirit to PRs that reduce filesystem calls in dataset scripts to fix 429 âToo Many Requestsâ errors.(Hugging Face)
5.4 Limit Rayâs model-loading pattern
Avoid doing from_pretrained("Qwen/Qwen3-32B") inside the inner loop of each Ray task.
Better:
- Load the model once per long-lived worker, then reuse it.
- Do not spawn and tear down many short-lived workers that each load the model from scratch.
- Avoid multiple calls that implicitly trigger metadata checks on the Hub (even if the weights are cached). vLLM and others have hit 429 just from repeated HEAD requests.(GitHub)
5.5 Upgrade huggingface_hub and let it handle 429s gracefully
Install a recent version:
pip install -U "huggingface_hub"
From version 1.2.0, the library:
- Parses the
RateLimit headers on 429,
- Sleeps exactly until reset,
- Retries automatically.(Hugging Face)
This does not change your quota, but it avoids hard crashes when you are only slightly over. Combine this with fewer requests, and your job should run smoothly.
5.6 If you use mirrors or custom endpoints, treat them the same
You mentioned it feels IP-based and changing accounts does not help. That also matches reports where:
- People got 429 even when using HF mirrors or different endpoints, because the underlying CDN or mirror still saw too many requests from the same IP range.(CSDN)
So do not rely on mirrors alone to âwork aroundâ the issue. They still need fewer requests and caching.
5.7 Network / IPv6 corner cases
Some HF users on cloud environments report:
- Getting 429 on almost any call from a specific IPv6 address block, while IPv4 works better.
- Solving issues by disabling IPv6 or using a different network path.(Hugging Face Forums)
This is not always needed, but if after fixing caching and concurrency you still get 429 instantly, testing from a different machine or with IPv6 disabled is a useful diagnostic.
6. How to confirm you are fixed
After you implement the changes:
-
Start from a clean job where:
- The model is already fully downloaded to
/srv/hf-cache/Qwen3-32B.
- Your training loop uses
local_files_only=True.
-
Run one Ray worker and verify:
- No 429 errors.
- No external network traffic to
huggingface.co during training (you can check logs or network).
-
Scale up workers:
- If 429 reappears only when scaling, you still have hidden points where each worker hits the Hub (e.g., separate caches or extra
from_pretrained calls).
If everything is local, 429 should disappear completely, because no resolver calls reach the Hub.
7. When to contact Hugging Face support
If, after:
- Using
HF_TOKEN in every process,
- Using a shared cache +
local_files_only=True,
- Limiting concurrent downloads,
- Running a current
huggingface_hub,
you still get 429s very early in the job, then:
- Collect example request IDs from the 429 responses if possible.
- Note your username, plan, and approximate time windows.
- Open a ticket or post on the Hub / Transformers forum.
HF staff regularly handle âIâm seeing unexpected 429sâ cases and will check logs and limits on their side.(Hugging Face)
Short summary
-
The error is a Hub resolver rate limit (HTTP 429) on /resolve/main/config.json, caused by many Ray workers hitting the same big model from the same IP.(Hugging Face)
-
Pro raises the quota but does not remove limits, and anonymous or shared-IP traffic can still be throttled.(Hugging Face)
-
To unblock and prevent recurrence:
- Ensure all workers use
HF_TOKEN.
- Pre-download
Qwen/Qwen3-32B once into a shared cache (HF_HOME / snapshot_download).
- Load from that path with
local_files_only=True so training never touches the Hub.
- Reduce parallel downloads (serialize snapshot_download, reuse long-lived workers).
- Upgrade
huggingface_hub so 429s trigger proper wait-and-retry instead of hard failure.(Hugging Face)