I am using a finetuned llama model for inference (using vllm) and keep getting this error:
Traceback (most recent call last):
File "/pfss/mlde/workspaces/mlde_wsp_Rohrbach/users/ns94feza/.conda/envs/llmonk2/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
response.raise_for_status()
File "/pfss/mlde/workspaces/mlde_wsp_Rohrbach/users/ns94feza/.conda/envs/llmonk2/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/hbXNov/llama3.1-8b_train_gpt_4o_verifications_e3_lr5e-7-31389-merged/resolve/main/sentence_bert_config.json
I have been using this script for weeks and only got this error recently. For context, I got rate limited earlier today while trying to upload a large dataset. I wonder if that caused this issue, and if it’s possible to lift the limit. Also, I have this model locally, so I’m able to successfully load it this way:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
but when I go the vllm way, everything crashes.
Any help would be much appreciated!
Tagging @Wauplin in case you could help