Hmm… Due to Numba-CUDA?
First training step enters NeMo’s Numba-JIT RNNT loss (rnnt_loss_gpu) and the process crashes. Not OOM. Not hardware. It is a JIT/linker mismatch between Numba-CUDA, your CUDA toolkit, and the NVIDIA driver. The model still touches RNNT even when you try to reduce its weight.
Causes
-
Numba-CUDA JIT/linker incompatibility.
RNNT in NeMo uses Numba-compiled CUDA kernels. This path is sensitive to exact Numba, CUDA toolkit, and driver linkers. The first kernel launch often fails hard when linkage is off. NVIDIA documents the intended FP16 RNNT stack and flags. Your backtraces point into .../parts/numba/rnnt_loss/rnnt.py. (NVIDIA)
-
CUDA 12 without Minor Version Compatibility (MVC).
With CUDA 12 wheels, JIT linking relies on NVJitLink. If pynvjitlink is missing or MVC is disabled, Numba can segfault on first launch even if plain PyTorch works. (Numba Official Document)
-
Required env flags not set before import.
NeMo’s RNNT+Numba path expects NVIDIA’s binding and relaxed checks at import time: NUMBA_CUDA_USE_NVIDIA_BINDING=1 and STRICT_NUMBA_COMPAT_CHECK=0. If you set them after Numba is imported in a notebook, it is too late. (NVIDIA)
-
Hybrid head always activating RNNT code paths.
Hybrid Transducer-CTC models carry both heads. Training and some eval flows call the RNNT loss even when you try to down-weight it. Your issue shows rnnt_loss_gpu still runs. NeMo’s own hybrid configs advise disabling eval transducer loss for stability and memory. (GitHub)
-
Torch/Torchaudio or RNNT backend mismatches.
If you swap losses or builds, torchaudio and torch CUDA builds must match. Otherwise you trade one linker problem for another. (PyTorch Documentation)
Solutions that work
1) Stand up a known-good RNNT+Numba environment
Pin to NVIDIA’s documented FP16 RNNT stack. This avoids most segfaults.
- Python 3.10, PyTorch + cu118,
cudatoolkit=11.8, cuda-python=11.8, numba=0.57.1.
- Install NeMo.
- Export env flags before Python starts.
- Verify with NeMo’s probe.
# Versions and rationale:
# NVIDIA FP16 RNNT + Numba guide → exact pins and checks
# https://research.nvidia.com/labs/conv-ai/blogs/2023/2023-10-28-numba-fp16/
conda create -n nemo-fa python=3.10 -y
conda activate nemo-fa
conda install -c pytorch -c nvidia -c conda-forge \
pytorch torchvision torchaudio pytorch-cuda=11.8 \
cudatoolkit=11.8 cuda-python=11.8 numba=0.57.1 cython -y
pip install "nemo_toolkit[all]>=1.20.0"
# Set before launching Python
export NUMBA_CUDA_USE_NVIDIA_BINDING=1
export STRICT_NUMBA_COMPAT_CHECK=0
python - <<'PY'
# Probe from the NVIDIA post
from nemo.core.utils import numba_utils
print(numba_utils.numba_cuda_is_supported(numba_utils.__NUMBA_MINIMUM_VERSION_FP16_SUPPORTED__))
print(numba_utils.is_numba_cuda_fp16_supported())
PY
If either check prints False, fix linkage first. Do not proceed. (NVIDIA)
2) If you must stay on CUDA 12
Enable Minor Version Compatibility so Numba can JIT-link kernels:
# Numba MVC docs
# https://numba.readthedocs.io/en/stable/cuda/minor_version_compatibility.html
pip install --extra-index-url https://pypi.nvidia.com pynvjitlink-cu12
export NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY=1
Restart the Python process to ensure Numba picks up the linker. (Numba Official Document)
3) Use a vetted NeMo container to isolate variables
Run inside the official NGC NeMo container and the hybrid recipe from examples/asr to confirm your data and config are fine:
# NGC NeMo container entry
# https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models (container catalog linked from NeMo pages)
docker run --gpus all -it --rm nvcr.io/nvidia/nemo:latest
# inside container
python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py \
model.train_ds.manifest_filepath=/data/train.json \
model.validation_ds.manifest_filepath=/data/val.json \
trainer.precision=16 \
model.compute_eval_loss=false # avoid transducer eval loss
The hybrid YAML in NeMo sets compute_eval_loss: false by default for long eval audio. Keep it false while you stabilize training. (GitHub)
4) Reduce RNNT surface area immediately
Two switches that stop many first-step failures:
- Disable RNNT eval loss:
model.compute_eval_loss=false. Hybrid configs ship with this guidance. (GitHub)
- Keep validation by WER only. Do not compute RNNT loss on eval batches. This avoids extra JIT kernels. (GitHub)
5) Bypass the Numba RNNT path but keep fine-tuning
Two pragmatic routes when RNNT remains unstable in your stack.
A) Swap to torchaudio’s CUDA RNNT loss.
Keep the hybrid architecture but replace the loss backend. Match torchaudio to your torch CUDA build.
# torchaudio RNNT loss (CUDA when torchaudio==torch CUDA)
# https://docs.pytorch.org/audio/main/generated/torchaudio.transforms.RNNTLoss.html
from torchaudio.transforms import RNNTLoss
loss_fn = RNNTLoss(blank=-1, fused_log_softmax=True).to("cuda")
# integrate in training_step to replace NeMo’s numba RNNT loss
This removes Numba and NVJitLink from the loop. (PyTorch Documentation)
B) Train CTC-only using the same encoder.
Convert the hybrid checkpoint to a pure CTC .nemo and fine-tune CTC:
# Script present in NeMo examples/helpers mirrors
# https://.../examples/asr/asr_hybrid_transducer_ctc/helpers/convert_nemo_asr_hybrid_to_ctc.py
python convert_nemo_asr_hybrid_to_ctc.py \
-i stt_fa_fastconformer_hybrid_large.nemo \
-o stt_fa_fastconformer_ctc_only.nemo \
-m ctc
This path is widely used for deployment and avoids RNNT kernels entirely while retaining the FastConformer encoder. (mirrors.sustech.edu.cn)
Model-specific context
- Target:
nvidia/stt_fa_fastconformer_hybrid_large. FastConformer encoder with hybrid Transducer+CTC decoders (~115M). Designed for fine-tuning; default decoding is RNNT, but CTC is first-class and documented. (Hugging Face)
- Your logs show RNNT loss invoked at step 1 despite low batch size and clean dataloader. That narrows the problem to RNNT JIT/linkage. (GitHub)
Pitfalls checklist
- Do not set env vars inside a running notebook after Numba import. Start a clean process with flags set. (NVIDIA)
- If you install
pynvjitlink but forget NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY=1, MVC stays off. Set both. (Numba Official Document)
- If you switch to torchaudio RNNT, ensure torchaudio and torch are the same CUDA build. Otherwise you can hit a different linker error. (PyTorch Documentation)
References
- NVIDIA: training NeMo RNNT with Numba FP16. Versions, env flags, probe. (NVIDIA)
- Numba MVC for CUDA 12.
pynvjitlink and env flag. (Numba Official Document)
- Farsi hybrid model card. Architecture, hybrid intent, decoding modes. (Hugging Face)
- Hybrid YAML with
compute_eval_loss: false. Practical stability tip. (GitHub)
- Torchaudio RNNTLoss docs. Alternative backend. (PyTorch Documentation)