CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images

Every H200 multi-GPU job I launch fails at CUDA initialization, before any model weights load. The error is:

```

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

```

The failure occurs in vLLM’s `multiproc_executor.py` at `WorkerProc` init. I’ve now tested three different vLLM image versions (CUDA 12.x runtime and CUDA 13 runtime) and the error is identical in all three. It is not model-specific, TP-size-specific, or CUDA-runtime-version-specific.

What I’ve confirmed:

| Setup | Result |

|—|—|

| `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel` on h200x4, single process (`nvidia-smi` + `torch.cuda.device_count()`) | works, returns 4 |

| `vllm/vllm-openai:v0.19.1` on l4x4 | works end-to-end |

| `vllm/vllm-openai:v0.19.1` on h200x4, Qwen2.5-7B | fails with 802 (twice on retry) |

| `vllm/vllm-openai:v0.19.1` on h200x8, GLM-4.5-Base | fails with 802 |

| `vllm/vllm-openai:cu130-nightly` on h200x4, Qwen2.5-7B | fails with 802 |

The fact that plain PyTorch single-process works on the same h200x4 node but every vLLM multi-process worker fails suggests the issue is specific to how CUDA context is initialized inside spawned worker subprocesses on H200 nodes. This pattern matches Fabric Manager / NVSwitch visibility regressions documented in:

- How do I fix a "system not initialized" error on multi-GPU Droplets? | DigitalOcean Documentation

- RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized · Issue #2554 · awslabs/amazon-eks-ami · GitHub

- CUDA initialization failure with error Error 802: system not yet initialized - GPU - Hardware - NVIDIA Developer Forums

HF Jobs users can’t restart Fabric Manager or check FM/driver version match.

**Details:**

- Flavors: h200x8 and h200x4 (both fail)

- Host driver (confirmed via `nvidia-smi` inside h200x4 container): NVIDIA 580.126.09, CUDA 13.0, 4× H200 @ 143771 MiB

- Job IDs:

- `elenaajayi/69e5aa28ac288e522d8f0179` (h200x8, GLM-4.5-Base, v0.19.1)

- `elenaajayi/69e5ab1dac288e522d8f017d` (h200x4, Qwen2.5-7B, v0.19.1)

- `elenaajayi/69e5ac7eac288e522d8f0181` (h200x4, Qwen2.5-7B, v0.19.1, retry)

- `elenaajayi/69e61257ac288e522d8f0281` (h200x4, Qwen2.5-7B, cu130-nightly)

- Controls:

- `elenaajayi/69e5a714ac288e522d8f0177` (l4x4, same image, runs clean)

- `elenaajayi/69e5be88cd8c002f31dffddc` (h200x4, plain PyTorch, nvidia-smi + device_count() succeed)

- Docker images tested: `vllm/vllm-openai:v0.19.1`, `vllm/vllm-openai:cu130-nightly`, `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel`

- `huggingface_hub`: 0.26.2

Is the HF infrastructure team aware of this? Is there a timeline for a fix, or an alternative H200 flavor I can try? This is blocking a NeurIPS paper run

Seems platform-side issue? LLM suggested:


This looks less like a pure vLLM bug and more like an H200 multi-GPU / NVSwitch / Fabric Manager issue on the HF side.

If I were debugging it, I’d probably try three things first:

If single-GPU works but multi-GPU fails, or Fabric state looks wrong, this probably isn’t something a user can really fix from inside the job. The NVIDIA AI Enterprise docs say Fabric Manager is required for HGX 1/2/4/8-GPU VMs, and on H100/H200 shared NVSwitch setups that management lives on the host / service-VM side. That makes it sound much more like an HF infra issue than an application issue.

So I’d probably keep both the forum thread and the GitHub issue updated with:

  • job ID
  • image + flavor
  • what works vs what fails
  • result of the single-GPU test
  • Fabric output
  • whether spawn changes anything

There’s also a somewhat similar AWS EKS issue where vLLM hit the same CUDA 802 path and it ended up looking node / AMI-side rather than model-side.

CUDA 802 on H200 multi-GPU Jobs, looks like NVSwitch Fabric Manager isn’t ready at job start

Posting here too in case anyone on Jobs / infra sees this first. Full repro with 4 failing job IDs, 2 working controls (l4x4 and plain-PyTorch on h200x4), images, flavors, and references is in GitHub issue
CUDA Error 802 on all H200 multi-GPU HF Jobs with vLLM, across CUDA 12 and CUDA 13 images · Issue #4128 · huggingface/huggingface_hub · GitHub .

Short version of what’s new since I filed that issue:

Fabric Manager state at job start on h200x4 (cu130-nightly image, driver 580.126.09 / CUDA 13.0):

Fabric
State : In Progress
Status : N/A
GPU Fabric GUID : N/A

Identical on all 4 H200s, never transitioned to Completed during the job.

vLLM at TP=1 on h200x4 also fails with CUDA 802 (CUDA_VISIBLE_DEVICES=0, tensor_parallel_size=1, Qwen/Qwen2.5-0.5B). So no tensor-parallel routing and no NVLink handoff in play – vLLM just touching CUDA on a single visible device. Same error as the multi-GPU case.

Combined with the plain-PyTorch control (which works fine on the same h200x4 flavor), it really does look like vLLM’s CUDA init path runs before Fabric Manager is ready on NVSwitch hosts, while PyTorch’s init tolerates it.

Asks:

  1. Can the Jobs entrypoint on NVSwitch flavors wait for Fabric State: Completed before starting user code? Even a 30-60s gate would prevent this.

  2. Is there a currently-working H200-class flavor (or equivalent multi-GPU flavor) with enough VRAM for a base model in the 70-110B range? I was targeting GLM-4.5-Base (355B, fits on h200x4 / x8 only). If that’s not unblockable this week, are any other large base models on your infra currently running successfully – e.g. DeepSeek V3 Base, LLaMA 4 Scout Base, Qwen 2.5 72B Base – or are they all hitting the same CUDA 802 path? Any confirmed-working pairing of (flavor, base model) would help most.

Time-sensitive on my end , so any pointer on a flavor that works today would help most.

Time-sensitive on my end , so any pointer on a flavor that works today would help most.

In that case, contact HF first anyway. via email is most reliable way: website@huggingface.co (and perhaps other address dedicated for Inference Endpoints? but generally this address is fine if vague.)