Dockerfile for deploying Qwen QwQ 32B on A10Gs , L4s or L40S

samagra14 · March 19, 2025, 1:29pm

Adding a Dockerfile here that can be used to deploy Qwen on any machine which has a combined GPU RAM of ~80GBs. The below Dockerfile is for multi-GPU L4 instances as L4s are the cheapest ones on AWS, feel free to make changes to try it on L40S, A10Gs, A100s etc. Soon will follow up with metrics around single request tokens / sec and throughput.

# Dockerfile for Qwen QwQ 32B

FROM vllm/vllm-openai:latest

# Enable HF Hub Transfer for faster downloads
ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Expose port 80
EXPOSE 80

# Entrypoint with API key
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", \
            # name of the model
           "--model", "Qwen/QwQ-32B", \
           # set the data type to bfloat16 - requires ~1400GB GPU memory
           "--dtype", "bfloat16", \
           "--trust-remote-code", \
           # below runs the model on 4 GPUs
           "--tensor-parallel-size","4", \
           # Maximum number of tokens, can lead to OOM if overestimated
           "--max-model-len", "8192", \
           # Port on which to run the vLLM server
           "--port", "80", \
           # CPU offload in GB. Need this as 8 H100s are not sufficient
           "--cpu-offload-gb", "80", \
           "--gpu-memory-utilization", "0.95", \
           # API key for authentication to the server stored in Tensorfuse secrets
           "--api-key", "${VLLM_API_KEY}"]

You can use the following commands to build and run the above Dockerfile.

docker build -t qwen-qwq-32b .

followed by

docker run --gpus all --shm-size=2g -p 80:80 -e VLLM_API_KEY=YOUR_API_KEY qwen-qwq-32b

Originally posted here: -

Topic		Replies	Views
TGI - use both GPU and CPU Beginners	1	54	April 19, 2025
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2547	December 19, 2023
How to use Qwen2-VL on multiple gpus? 🤗Transformers	2	1278	September 28, 2024
Too large to be loaded automatically (16GB > 10GB) issue with QWEN 2.5 VL 7B Inference Endpoints on the Hub	2	103	April 15, 2025
Model super slow and barely uses any CPU or memory Beginners	4	1549	July 4, 2024

Dockerfile for deploying Qwen QwQ 32B on A10Gs , L4s or L40S

Related topics