Hello, Hugging Face community!
I’ve been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I’m facing a series of dependency conflicts.
Current Configuration
- Base image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
- Model: nicoboss/DeepSeek-R1-Distill-Qwen-32B-Uncensored
- Hardware: NVIDIA L4 GPUs (4x21GB)
Identified Issues
- FlashAttention Conflict:
ERROR: Failed to import transformers.models.qwen2.modeling_qwen2 /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi - TensorRT Not Initializing:
WARNING: TensorRT unavailable: cannot import name ‘optimize_model’ from ‘torch_tensorrt.dynamo’ - Library Version Conflicts:
ERROR: Cannot install tensorrt==8.6.1 and torch-tensorrt==1.3.0 torch-tensorrt 1.3.0 depends on tensorrt<8.6.0 and >=8.5.1.7 - NumPy Compatibility Issues:
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.4 - CUDA Not Detected During Build:
PyTorch 2.0.1+cu118, CUDA available: False, CUDA version: N/A
What I’ve Tried
- Various PyTorch versions (2.1.2, 2.0.1)
- Different TensorRT and torch-tensorrt versions
- Various NumPy versions (<2.0, 1.24.3)
- Checking version compatibility and identifying conflicts
- Enabling/disabling FlashAttention
The model runs without TensorRT, but my specific goal is to set up TensorRT optimization.
Questions
- Which exact versions of PyTorch, TensorRT, and torch-tensorrt are compatible with each other and with NVIDIA L4 GPUs in the HF environment?
- How to correctly configure FlashAttention to work with TensorRT?
- Are there any ready-to-use Dockerfile configurations for large models with TensorRT optimization in the Hugging Face environment?
I would greatly appreciate any help and advice!
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
LABEL maintainer=“DeepSeek Model Server”
LABEL description=“DeepSeek-R1-Distill Model Server с TensorRT и оптимизированной настройкой”
ENV HF_HOME=/tmp/huggingface_cache
RUN mkdir -p /tmp/huggingface_cache && chmod -R 777 /tmp/huggingface_cache
ENV PYTHONUNBUFFERED=1
PYTHONDONTWRITEBYTECODE=1
DEBIAN_FRONTEND=noninteractive
CUDA_HOME=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
PYTORCH_CUDA_ALLOC_CONF=“max_split_size_mb:512”
TORCH_CUDNN_V8_API_ENABLED=1
TORCH_ALLOW_TF32=1
TORCH_ENABLE_CUDA_CONV_HALF_KERNELS=1
NCCL_P2P_DISABLE=0
NCCL_IB_DISABLE=0
TRANSFORMERS_NO_TORCH_COMPILE=1
PORT=7860
MAX_JOBS=4
USE_TENSORRT=1
USE_FLASH_ATTENTION=0
RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
cmake
git
python3-dev
python3-pip
libopenblas-dev
libblas-dev
liblapack-dev
curl
wget
pkg-config
ninja-build
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel
RUN pip3 install --no-cache-dir “numpy<2.0.0”
RUN pip3 install --no-cache-dir
torch==2.1.2+cu118
torchvision==0.16.2+cu118
torchaudio==2.1.2+cu118
–extra-index-url https://download.pytorch.org/whl/cu118
RUN python3 -c “import torch; print(f’PyTorch {torch.version}, CUDA available: {torch.cuda.is_available()}, CUDA version: {torch.version.cuda if torch.cuda.is_available() else "N/A"}')”
RUN pip3 install --no-cache-dir ninja packaging
RUN pip3 install --no-cache-dir
tensorrt==8.5.3.1
torch-tensorrt==1.4.0
onnx==1.15.0
onnxruntime-gpu==1.16.3
RUN pip3 install --no-cache-dir triton==2.0.0
RUN pip3 install --no-cache-dir einops>=0.6.1 &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=0 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
(echo “FlashAttention no” &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=1 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
echo “FlashAttention no”)
RUN pip3 install --no-cache-dir “huggingface-hub>=0.24.0,<1.0”
RUN pip3 install --no-cache-dir
“transformers==4.47.1”
“tokenizers==0.21.0”
“accelerate==0.26.1”
“sentencepiece==0.1.99”
“pydantic==2.6.1”
“fastapi==0.109.0”
“uvicorn[standard]==0.27.1”
“psutil==5.9.6”
“cachetools==5.3.2”
“pynvml==11.5.0”
“jinja2==3.1.3”
“tqdm>=4.66.3”
“packaging==23.2”
“typing_extensions==4.10.0”
“bitsandbytes==0.42.0”
RUN pip3 install --no-cache-dir
“sentence-transformers==2.5.0”
“scikit-learn==1.3.2”
“rank-bm25==0.2.2”
“faiss-cpu==1.7.4”
“nltk==3.8.1”
“langdetect==1.0.9”
“protobuf==3.20.3”
“sacremoses==0.0.53”
“jieba==0.42.1”
COPY . /app/
EXPOSE 7860
CMD [“uvicorn”, “app:app”, “–host”, “0.0.0.0”, “–port”, “7860”, “–workers”, “1”]