I've been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I'm facing a series of dependency conflicts

mattew11 · April 14, 2025, 9:57pm

Hello, Hugging Face community!

I’ve been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I’m facing a series of dependency conflicts.

Current Configuration

Base image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
Model: nicoboss/DeepSeek-R1-Distill-Qwen-32B-Uncensored
Hardware: NVIDIA L4 GPUs (4x21GB)

Identified Issues

FlashAttention Conflict:
ERROR: Failed to import transformers.models.qwen2.modeling_qwen2 /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
TensorRT Not Initializing:
WARNING: TensorRT unavailable: cannot import name ‘optimize_model’ from ‘torch_tensorrt.dynamo’
Library Version Conflicts:
ERROR: Cannot install tensorrt==8.6.1 and torch-tensorrt==1.3.0 torch-tensorrt 1.3.0 depends on tensorrt<8.6.0 and >=8.5.1.7
NumPy Compatibility Issues:
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.4
CUDA Not Detected During Build:
PyTorch 2.0.1+cu118, CUDA available: False, CUDA version: N/A

What I’ve Tried

Various PyTorch versions (2.1.2, 2.0.1)
Different TensorRT and torch-tensorrt versions
Various NumPy versions (<2.0, 1.24.3)
Checking version compatibility and identifying conflicts
Enabling/disabling FlashAttention

The model runs without TensorRT, but my specific goal is to set up TensorRT optimization.

Questions

Which exact versions of PyTorch, TensorRT, and torch-tensorrt are compatible with each other and with NVIDIA L4 GPUs in the HF environment?
How to correctly configure FlashAttention to work with TensorRT?
Are there any ready-to-use Dockerfile configurations for large models with TensorRT optimization in the Hugging Face environment?

I would greatly appreciate any help and advice!
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

LABEL maintainer=“DeepSeek Model Server”
LABEL description=“DeepSeek-R1-Distill Model Server с TensorRT и оптимизированной настройкой”

ENV HF_HOME=/tmp/huggingface_cache
RUN mkdir -p /tmp/huggingface_cache && chmod -R 777 /tmp/huggingface_cache
ENV PYTHONUNBUFFERED=1
PYTHONDONTWRITEBYTECODE=1
DEBIAN_FRONTEND=noninteractive
CUDA_HOME=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
PYTORCH_CUDA_ALLOC_CONF=“max_split_size_mb:512”
TORCH_CUDNN_V8_API_ENABLED=1
TORCH_ALLOW_TF32=1
TORCH_ENABLE_CUDA_CONV_HALF_KERNELS=1
NCCL_P2P_DISABLE=0
NCCL_IB_DISABLE=0
TRANSFORMERS_NO_TORCH_COMPILE=1
PORT=7860
MAX_JOBS=4
USE_TENSORRT=1
USE_FLASH_ATTENTION=0

RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
cmake
git
python3-dev
python3-pip
libopenblas-dev
libblas-dev
liblapack-dev
curl
wget
pkg-config
ninja-build
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel

RUN pip3 install --no-cache-dir “numpy<2.0.0”

RUN pip3 install --no-cache-dir
torch==2.1.2+cu118
torchvision==0.16.2+cu118
torchaudio==2.1.2+cu118
–extra-index-url https://download.pytorch.org/whl/cu118

RUN python3 -c “import torch; print(f’PyTorch {torch.version}, CUDA available: {torch.cuda.is_available()}, CUDA version: {torch.version.cuda if torch.cuda.is_available() else "N/A"}')”

RUN pip3 install --no-cache-dir ninja packaging

RUN pip3 install --no-cache-dir
tensorrt==8.5.3.1
torch-tensorrt==1.4.0
onnx==1.15.0
onnxruntime-gpu==1.16.3

RUN pip3 install --no-cache-dir triton==2.0.0

RUN pip3 install --no-cache-dir einops>=0.6.1 &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=0 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
(echo “FlashAttention no” &&
FLASH_ATTENTION_SKIP_CUDA_BUILD=1 pip3 install --no-cache-dir flash-attn==2.3.0 --no-build-isolation ||
echo “FlashAttention no”)

RUN pip3 install --no-cache-dir “huggingface-hub>=0.24.0,<1.0”

RUN pip3 install --no-cache-dir
“transformers==4.47.1”
“tokenizers==0.21.0”
“accelerate==0.26.1”
“sentencepiece==0.1.99”
“pydantic==2.6.1”
“fastapi==0.109.0”
“uvicorn[standard]==0.27.1”
“psutil==5.9.6”
“cachetools==5.3.2”
“pynvml==11.5.0”
“jinja2==3.1.3”
“tqdm>=4.66.3”
“packaging==23.2”
“typing_extensions==4.10.0”
“bitsandbytes==0.42.0”

RUN pip3 install --no-cache-dir
“sentence-transformers==2.5.0”
“scikit-learn==1.3.2”
“rank-bm25==0.2.2”
“faiss-cpu==1.7.4”
“nltk==3.8.1”
“langdetect==1.0.9”
“protobuf==3.20.3”
“sacremoses==0.0.53”
“jieba==0.42.1”

COPY . /app/

EXPOSE 7860

CMD [“uvicorn”, “app:app”, “–host”, “0.0.0.0”, “–port”, “7860”, “–workers”, “1”]

John6666 · April 15, 2025, 3:01am

Hmm… I wonder if this is possible…

github.com/NVIDIA/TensorRT-LLM

pip install of TensorRT-LLM not working on Windows

opened 04:40AM - 25 Feb 24 UTC

closed 11:43PM - 10 Mar 24 UTC

bkmckenzie

triaged

I issue the following from powershell and get the result shown below that: …PS C:\Users\bkmckenzie> pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121 Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com, https://download.pytorch.org/whl/cu121 Collecting tensorrt_llm Downloading https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.7.1-cp310-cp310-win_amd64.whl (443.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 443.7/443.7 MB 7.4 MB/s eta 0:00:00 Collecting accelerate==0.20.3 (from tensorrt_llm) Downloading accelerate-0.20.3-py3-none-any.whl.metadata (17 kB) Collecting build (from tensorrt_llm) Downloading build-1.0.3-py3-none-any.whl.metadata (4.2 kB) Collecting colored (from tensorrt_llm) Downloading colored-2.2.4-py3-none-any.whl.metadata (3.6 kB) Collecting cuda-python==12.2.0 (from tensorrt_llm) Downloading cuda_python-12.2.0-cp310-cp310-win_amd64.whl.metadata (770 bytes) Collecting diffusers==0.15.0 (from tensorrt_llm) Downloading diffusers-0.15.0-py3-none-any.whl.metadata (19 kB) Collecting mpi4py (from tensorrt_llm) Downloading mpi4py-3.1.5-cp310-cp310-win_amd64.whl.metadata (8.0 kB) Collecting numpy (from tensorrt_llm) Downloading numpy-1.26.4-cp310-cp310-win_amd64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 3.4 MB/s eta 0:00:00 Collecting onnx>=1.12.0 (from tensorrt_llm) Downloading onnx-1.15.0-cp310-cp310-win_amd64.whl.metadata (15 kB) Collecting polygraphy (from tensorrt_llm) Downloading https://pypi.nvidia.com/polygraphy/polygraphy-0.49.0-py2.py3-none-any.whl (327 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 327.9/327.9 kB 9.9 MB/s eta 0:00:00 Collecting pywin32 (from tensorrt_llm) Downloading pywin32-306-cp310-cp310-win_amd64.whl.metadata (6.6 kB) Collecting sentencepiece>=0.1.99 (from tensorrt_llm) Downloading sentencepiece-0.2.0-cp310-cp310-win_amd64.whl.metadata (8.3 kB) Collecting tensorrt==9.2.0.post12.dev5 (from tensorrt_llm) Downloading https://pypi.nvidia.com/tensorrt/tensorrt-9.2.0.post12.dev5.tar.gz (18 kB) Installing build dependencies ... done Getting requirements to build wheel ... error error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [25 lines of output] Traceback (most recent call last): File "c:\python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module> main() File "c:\python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "c:\python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel return hook(config_settings) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel return self._get_build_requires(config_settings, requirements=['wheel']) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires self.run_setup() File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\build_meta.py", line 487, in run_setup super().run_setup(setup_script=setup_script) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup exec(code, locals()) File "<string>", line 120, in <module> File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\__init__.py", line 102, in setup _install_setup_requires(attrs) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\__init__.py", line 70, in _install_setup_requires dist = MinimalDistribution(attrs) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\__init__.py", line 52, in __init__ super().__init__(filtered) File "C:\Users\bkmckenzie\AppData\Local\Temp\pip-build-env-nqsjxc1d\overlay\Lib\site-packages\setuptools\dist.py", line 297, in __init__ for ep in metadata.entry_points(group='distutils.setup_keywords'): TypeError: entry_points() got an unexpected keyword argument 'group' [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.

mattew11 · April 15, 2025, 11:21am

can you help with the launch?

John6666 · April 15, 2025, 11:26am

That’s fine. I don’t have much knowledge about Docker, but I think it should be possible to loosen the library version requirements…

John6666 · April 15, 2025, 11:39am

For now, about 5.

You’re correct about Docker’s buildtime in Spaces, it doesn’t provide access to GPU hardware. Thus, any GPU-related commands shouldn’t be executed during your Dockerfile’s build step. For instance, commands like nvidia-smi or torch.cuda.is_available() can’t be run while building an image.

John6666 · April 15, 2025, 11:53am

It’s a different implementation, but it seems like TensorRTLLM is easier to use with TensorRT…

Or try newer torch-tensorrt?

github.com/pytorch/TensorRT

docker/README.md

main

# Building a Torch-TensorRT container

* Use `Dockerfile` to build a container which provides the exact development environment that our main branch is usually tested against.

* The `Dockerfile` currently uses <a href="https://github.com/bazelbuild/bazelisk">Bazelisk</a> to select the Bazel version, and uses the exact library versions of Torch and CUDA listed in <a href="https://github.com/pytorch/TensorRT#dependencies">dependencies</a>.
  * The desired versions of TensorRT must be specified as build-args, with major and minor versions as in: `--build-arg TENSORRT_VERSION=a.b`
  * [**Optional**] The desired base image be changed by explicitly setting a base image, as in `--build-arg BASE_IMG=nvidia/cuda:11.8.0-devel-ubuntu22.04`, though this is optional.
  * [**Optional**] Additionally, the desired Python version can be changed by explicitly setting a version, as in `--build-arg PYTHON_VERSION=3.11`, though this is optional as well.

* This `Dockerfile` installs `cxx11-abi` versions of Pytorch and builds Torch-TRT using `cxx11-abi` libtorch as well. As of torch 2.7, torch requires `cxx11-abi` for all CUDA 11.8, 12.4, 12.6, and later versions.

Note: By default the container uses the `cxx11-abi` version of Torch + Torch-TRT. If you are using a workflow that requires a build of PyTorch on the PRE CXX11 ABI, please add the Docker build argument: `--build-arg USE_PRE_CXX11_ABI=1`

### Dependencies

* Install nvidia-docker by following https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

### Instructions

- The example below uses TensorRT 10.9.0.34

This file has been truncated. show original

Topic		Replies	Views
Not able to install 'pycuda' on HuggingFace container Amazon SageMaker	0	1346	August 14, 2022
Discrepancy in Model Inference: Local vs. Hugging Face Model Hub 🤗Transformers	1	811	December 27, 2023
Device Type Error With Diffusers Pipeline TAT Beginners	2	1297	August 19, 2023
Can't inference on the train model due to some cuda problem Beginners	0	802	June 6, 2023
HugginFace dataset error: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor 🤗Datasets	3	11444	May 6, 2022

I've been trying for several days to set up TensorRT for accelerating inference of the DeepSeek-R1-Distill-Qwen-32B model in Hugging Face space, but I'm facing a series of dependency conflicts

Current Configuration

Identified Issues

What I’ve Tried

Questions

Related topics