PTX JIT broken on RTX 5080 (Blackwell / sm_120) – missing libnvptxcompiler.so in CUDA 12.8 / 12.9

Hey folks :waving_hand:,

I’m experimenting with LLM inference (vLLM, FlashAttention, etc.) on a custom-built workstation featuring an RTX 5080 (Blackwell architecture). I compiled PyTorch 2.9.0 myself with support for sm_120 (Blackwell) using CUDA 12.8.

Everything works great with AOT compilation, but JIT compilation fails using torch.utils.cpp_extension.load() or similar APIs — e.g. when building FlashAttention, custom CUDA kernels, or low-level ops for vLLM.

The error:

After deep debugging, I found the culprit: libnvptxcompiler.so is completely missing from the .run and .deb CUDA 12.8 and 12.9 installers — as well as the official Docker images (!). This breaks JIT support entirely for Blackwell cards.

Full report with reproduction steps and technical analysis:
:link: https://forums.developer.nvidia.com/t/missing-libnvptxcompiler-so-in-cuda-12-8-12-9-blocking-ptx-jit-on-blackwell-gpus-sm-120-rtx-5080/338033


Let me know if you’ve encountered this, or found a workaround. Currently there’s no official CUDA support for JIT kernel compilation on Blackwell — which breaks a lot of modern tooling (vLLM, FlashAttention, cpp_extension, etc.).

Thanks!


:brain: Technical FAQ – Compilation vs Runtime Execution

Q: How were you able to compile PyTorch, FlashAttention, or vLLM without libnvptxcompiler.so?
A: This shared object (.so) is only required for runtime JIT PTX compilation.
It’s not needed for AOT (Ahead-Of-Time) builds, as long as the kernels are compiled ahead with the correct flags (sm_120, compute_120, etc.).
→ That’s why compilation succeeds, but any dynamic JIT (e.g. cpp_extension.load()) fails at runtime.


1 Like