I've been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I'm facing a series of dependency conflicts

Hello, Hugging Face community!

I’ve been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I’m facing a series of dependency conflicts.

Current Configuration

  • Base image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
  • Model: nicoboss/DeepSeek-R1-Distill-Qwen-32B-Uncensored
  • Hardware: NVIDIA L4 GPUs (4x21GB)

Identified Issues

  1. FlashAttention Conflict:
    ERROR: Failed to import transformers.models.qwen2.modeling_qwen2 /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  2. TensorRT Not Initializing:
    WARNING: TensorRT unavailable: cannot import name ‘optimize_model’ from ‘torch_tensorrt.dynamo’
  3. Library Version Conflicts:
    ERROR: Cannot install tensorrt==8.6.1 and torch-tensorrt==1.3.0 torch-tensorrt 1.3.0 depends on tensorrt<8.6.0 and >=8.5.1.7
  4. NumPy Compatibility Issues:
    A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.4
  5. CUDA Not Detected During Build:
    PyTorch 2.0.1+cu118, CUDA available: False, CUDA version: N/A

What I’ve Tried

  • Various PyTorch versions (2.1.2, 2.0.1)
  • Different TensorRT and torch-tensorrt versions
  • Various NumPy versions (<2.0, 1.24.3)
  • Checking version compatibility and identifying conflicts
  • Enabling/disabling FlashAttention

The model runs without TensorRT, but my specific goal is to set up TensorRT optimization.

Questions

  1. Which exact versions of PyTorch, TensorRT, and torch-tensorrt are compatible with each other and with NVIDIA L4 GPUs in the HF environment?
  2. How to correctly configure FlashAttention to work with TensorRT?
  3. Are there any ready-to-use Dockerfile configurations for large models with TensorRT optimization in the Hugging Face environment?

I would greatly appreciate any help and advice!
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

LABEL maintainer=“DeepSeek Model Server”
LABEL description=“DeepSeek-R1-Distill Model Server с TensorRT и оптимизированной настройкой”

ENV HF_HOME=/tmp/huggingface_cache
RUN mkdir -p /tmp/huggingface_cache && chmod -R 777 /tmp/huggingface_cache
ENV PYTHONUNBUFFERED=1
PYTHONDONTWRITEBYTECODE=1
DEBIAN_FRONTEND=noninteractive
CUDA_HOME=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
PYTORCH_CUDA_ALLOC_CONF=“max_split_size_mb:512”
TORCH_CUDNN_V8_API_ENABLED=1
TORCH_ALLOW_TF32=1
TORCH_ENABLE_CUDA_CONV_HALF_KERNELS=1
NCCL_P2P_DISABLE=0
NCCL_IB_DISABLE=0
TRANSFORMERS_NO_TORCH_COMPILE=1
PORT=7860
MAX_JOBS=4
USE_TENSORRT=1
USE_FLASH_ATTENTION=0

RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
cmake
git
python3-dev
python3-pip
libopenblas-dev
libblas-dev
liblapack-dev
curl
wget
pkg-config
ninja-build
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel

RUN pip3 install --no-cache-dir “numpy<2.0.0”

RUN pip3 install --no-cache-dir
torch==2.1.2+cu118
torchvision==0.16.2+cu118
torchaudio==2.1.2+cu118
–extra-index-url https://download.pytorch.org/whl/cu118

RUN python3 -c “import torch; print(f’PyTorch {torch.version}, CUDA available: {torch.cuda.is_available()}, CUDA version: {torch.version.cuda if torch.cuda.is_available() else "N/A"}')”

RUN pip3 install --no-cache-dir ninja packaging

RUN pip3 install --no-cache-dir
tensorrt==8.5.3.1
torch-tensorrt==1.4.0
onnx==1.15.0
onnxruntime-gpu==1.16.3

RUN pip3 install --no-cache-dir triton==2.0.0

RUN pip3 install --no-cache-dir einops>=0.6.1 &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=0 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
(echo “FlashAttention no” &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=1 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
echo “FlashAttention no”)

RUN pip3 install --no-cache-dir “huggingface-hub>=0.24.0,<1.0”

RUN pip3 install --no-cache-dir
“transformers==4.47.1”
“tokenizers==0.21.0”
“accelerate==0.26.1”
“sentencepiece==0.1.99”
“pydantic==2.6.1”
“fastapi==0.109.0”
“uvicorn[standard]==0.27.1”
“psutil==5.9.6”
“cachetools==5.3.2”
“pynvml==11.5.0”
“jinja2==3.1.3”
“tqdm>=4.66.3”
“packaging==23.2”
“typing_extensions==4.10.0”
“bitsandbytes==0.42.0”

RUN pip3 install --no-cache-dir
“sentence-transformers==2.5.0”
“scikit-learn==1.3.2”
“rank-bm25==0.2.2”
“faiss-cpu==1.7.4”
“nltk==3.8.1”
“langdetect==1.0.9”
“protobuf==3.20.3”
“sacremoses==0.0.53”
“jieba==0.42.1”

COPY . /app/

EXPOSE 7860

CMD [“uvicorn”, “app:app”, “–host”, “0.0.0.0”, “–port”, “7860”, “–workers”, “1”]

1 Like

Hmm… I wonder if this is possible…

can you help with the launch?

1 Like

That’s fine. I don’t have much knowledge about Docker, but I think it should be possible to loosen the library version requirements…

For now, about 5.

You’re correct about Docker’s buildtime in Spaces, it doesn’t provide access to GPU hardware. Thus, any GPU-related commands shouldn’t be executed during your Dockerfile’s build step. For instance, commands like nvidia-smi or torch.cuda.is_available() can’t be run while building an image.

It’s a different implementation, but it seems like TensorRTLLM is easier to use with TensorRT…

Or try newer torch-tensorrt?