I have a server with a A100 GPU, and I am trying to make it server a LLama 3.2 model (I have access) using my account in hugging face but I am getting always:
requests.exceptions.ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /meta-llama/Llama-3.2-3B-Instruct/resolve/main/config.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f3a991a2ab0>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: 931b8e3f-1122-4dcd-9062-a3a9fa631783)')
My docker invocation is like:
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=hf_XXXX" --env "VLLM_API_KEY=XXX" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Llama-3.2-3B-Instruct --max_model_len 10000 --chat-template /root/.cache/huggingface/templates/tool_chat_template_llama3.2_pythonic.jinja
As it’s the first time I call the model I cant use the offline mode.
I can certainly resolved the domain huggingface.co
and connect to port 443 from the server.
Thank you very much for any suggestion,
Ruben