Help Needed: Segmentation fault when fine-tuning NeMo stt_fa_fastconformer_hybrid_large

kiarashQ · October 26, 2025, 10:56am

Hi everyone,

I’m hoping to get some help from the community. I’m trying to fine-tune the NVIDIA NeMo model stt_fa_fastconformer_hybrid_large for Farsi ASR on my own dataset. The pre-trained model has a high WER (~46%), so fine-tuning is essential.

However, no matter what I try, the process crashes with a Segmentation fault (or the kernel dies in Jupyter) the moment training begins.

I’ve spent a lot of time debugging this and have ruled out all the usual suspects:

It’s not an out-of-memory error (fails even with batch size 1).
It’s not a hardware or power supply issue (a pure PyTorch stress test works perfectly).
It’s seemingly not a broken environment (I’ve rebuilt a clean Conda environment with pinned versions of PyTorch, NeMo 1.21.0, and PL 1.9.5 (also tried PL 2.0.7)).

The crash seems to be caused by a low-level bug in the Numba-based RNNT loss function that this specific model uses.

I’ve created a very detailed bug report on the official NeMo GitHub with all my findings and environment details. You can see the full technical breakdown here:

github.com/NVIDIA-NeMo/NeMo

Segmentation fault when fine-tuning stt_fa_fastconformer_hybrid_large due to Numba/RNNT loss

opened 10:51AM - 26 Oct 25 UTC

KiarashQ

bug community-request

Hello NeMo Team, I am trying to fine-tune the nvidia/stt_fa_fastconformer_hybri…d_large model on a custom Farsi dataset, but I am consistently encountering a fatal Segmentation fault (or silent kernel death in Jupyter) during the first training step. The pre-trained model has a very high WER (46%) on my test set, so fine-tuning is critical. **The Problem:** The crash happens immediately after the trainer.fit() call, right after the initial validation sanity check and before the first training batch completes. The final log message before the crash is always a warning related to the Numba RNNT loss calculation. **Environment:** - OS: Ubuntu 22.04 (in a Docker container) - GPU: NVIDIA GeForce RTX 4070 SUPER - Python: 3.10 (via Conda) - Key Libraries (Installed in a clean Conda env): - pytorch: 2.3.1+cu121 - pytorch-lightning: 1.9.5 - nemo_toolkit[asr]: 1.21.0 - numba: 0.59.1 **Summary of Extensive Debugging:** We have systematically ruled out all common causes for this type of crash: 1. Out of Memory: The crash occurs even with batch_size=1 and on a tiny 20-sample smoke test dataset. nvidia-smi shows low VRAM and power usage at the moment of the crash. 2. Hardware/Power Issues: Power limiting the GPU (nvidia-smi -pl 220) has no effect. A minimal PyTorch stress test (allocating large tensors and performing matrix multiplication) passes without any issues, proving the hardware, driver, and core PyTorch installation are stable. 3. Environment Issues: The issue persists after a complete environment rebuild using Conda with pinned, known-compatible library versions. 4. Calling Method: The crash occurs identically whether using an interactive, object-oriented approach in a Jupyter Notebook or using the official examples/asr/speech_to_text_finetune.py script. 5. Dataloader: A direct test of the dataloader (batch = next(iter(train_dataloader))) passes successfully. The crash only happens when the Trainer brings the model and data together for the training_step. **The Conclusive Finding:** The final traceback before the crash points directly to the rnnt_loss_gpu function, which is implemented with Numba. ``` Current thread 0x00007caf781b1740 (most recent call first): File ".../numba/cuda/cudadrv/driver.py", line 326 in safe_cuda_api_call ... File ".../nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py", line 174 in rnnt_loss_gpu ... File ".../nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 425 in training_step ``` We have also confirmed that even setting cfg.loss.rnnt_weight = 0.0 does not prevent this function from being called, leading to the same crash. This strongly suggests an intractable, low-level bug within the warprnnt_numba library when used with this specific model in this environment. **The Ask:** 1. Is this a known issue? 2. Is there a "golden configuration" of NVIDIA drivers, CUDA, and library versions that is known to work for fine-tuning this Farsi Hybrid model? 3. Is there any way to truly disable the RNNT loss and fine-tune this model using only its CTC head? The model appears to be hard-coded to always call the RNNT loss function. 4. Thank you for your time and for the fantastic work on NeMo. Any guidance would be deeply appreciated.

Has anyone else in the community successfully fine-tuned this Farsi Hybrid model? Is there a combination of drivers or library versions that works?

Any advice or shared experience would be a huge help! Thanks in advance.

John6666 · October 26, 2025, 12:07pm

Hmm… Due to Numba-CUDA?

First training step enters NeMo’s Numba-JIT RNNT loss (rnnt_loss_gpu) and the process crashes. Not OOM. Not hardware. It is a JIT/linker mismatch between Numba-CUDA, your CUDA toolkit, and the NVIDIA driver. The model still touches RNNT even when you try to reduce its weight.

Causes

Numba-CUDA JIT/linker incompatibility.
RNNT in NeMo uses Numba-compiled CUDA kernels. This path is sensitive to exact Numba, CUDA toolkit, and driver linkers. The first kernel launch often fails hard when linkage is off. NVIDIA documents the intended FP16 RNNT stack and flags. Your backtraces point into .../parts/numba/rnnt_loss/rnnt.py. (NVIDIA)
CUDA 12 without Minor Version Compatibility (MVC).
With CUDA 12 wheels, JIT linking relies on NVJitLink. If pynvjitlink is missing or MVC is disabled, Numba can segfault on first launch even if plain PyTorch works. (Numba Official Document)
Required env flags not set before import.
NeMo’s RNNT+Numba path expects NVIDIA’s binding and relaxed checks at import time: NUMBA_CUDA_USE_NVIDIA_BINDING=1 and STRICT_NUMBA_COMPAT_CHECK=0. If you set them after Numba is imported in a notebook, it is too late. (NVIDIA)
Hybrid head always activating RNNT code paths.
Hybrid Transducer-CTC models carry both heads. Training and some eval flows call the RNNT loss even when you try to down-weight it. Your issue shows rnnt_loss_gpu still runs. NeMo’s own hybrid configs advise disabling eval transducer loss for stability and memory. (GitHub)
Torch/Torchaudio or RNNT backend mismatches.
If you swap losses or builds, torchaudio and torch CUDA builds must match. Otherwise you trade one linker problem for another. (PyTorch Documentation)

Solutions that work

1) Stand up a known-good RNNT+Numba environment

Pin to NVIDIA’s documented FP16 RNNT stack. This avoids most segfaults.

Python 3.10, PyTorch + cu118, cudatoolkit=11.8, cuda-python=11.8, numba=0.57.1.
Install NeMo.
Export env flags before Python starts.
Verify with NeMo’s probe.

# Versions and rationale:
# NVIDIA FP16 RNNT + Numba guide → exact pins and checks
# https://research.nvidia.com/labs/conv-ai/blogs/2023/2023-10-28-numba-fp16/

conda create -n nemo-fa python=3.10 -y
conda activate nemo-fa
conda install -c pytorch -c nvidia -c conda-forge \
  pytorch torchvision torchaudio pytorch-cuda=11.8 \
  cudatoolkit=11.8 cuda-python=11.8 numba=0.57.1 cython -y
pip install "nemo_toolkit[all]>=1.20.0"

# Set before launching Python
export NUMBA_CUDA_USE_NVIDIA_BINDING=1
export STRICT_NUMBA_COMPAT_CHECK=0

python - <<'PY'
# Probe from the NVIDIA post
from nemo.core.utils import numba_utils
print(numba_utils.numba_cuda_is_supported(numba_utils.__NUMBA_MINIMUM_VERSION_FP16_SUPPORTED__))
print(numba_utils.is_numba_cuda_fp16_supported())
PY

If either check prints False, fix linkage first. Do not proceed. (NVIDIA)

2) If you must stay on CUDA 12

Enable Minor Version Compatibility so Numba can JIT-link kernels:

# Numba MVC docs
# https://numba.readthedocs.io/en/stable/cuda/minor_version_compatibility.html

pip install --extra-index-url https://pypi.nvidia.com pynvjitlink-cu12
export NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY=1

Restart the Python process to ensure Numba picks up the linker. (Numba Official Document)

3) Use a vetted NeMo container to isolate variables

Run inside the official NGC NeMo container and the hybrid recipe from examples/asr to confirm your data and config are fine:

# NGC NeMo container entry
# https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models (container catalog linked from NeMo pages)

docker run --gpus all -it --rm nvcr.io/nvidia/nemo:latest
# inside container
python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py \
  model.train_ds.manifest_filepath=/data/train.json \
  model.validation_ds.manifest_filepath=/data/val.json \
  trainer.precision=16 \
  model.compute_eval_loss=false  # avoid transducer eval loss

The hybrid YAML in NeMo sets compute_eval_loss: false by default for long eval audio. Keep it false while you stabilize training. (GitHub)

4) Reduce RNNT surface area immediately

Two switches that stop many first-step failures:

Disable RNNT eval loss: model.compute_eval_loss=false. Hybrid configs ship with this guidance. (GitHub)
Keep validation by WER only. Do not compute RNNT loss on eval batches. This avoids extra JIT kernels. (GitHub)

5) Bypass the Numba RNNT path but keep fine-tuning

Two pragmatic routes when RNNT remains unstable in your stack.

A) Swap to torchaudio’s CUDA RNNT loss.
Keep the hybrid architecture but replace the loss backend. Match torchaudio to your torch CUDA build.

# torchaudio RNNT loss (CUDA when torchaudio==torch CUDA)
# https://docs.pytorch.org/audio/main/generated/torchaudio.transforms.RNNTLoss.html
from torchaudio.transforms import RNNTLoss
loss_fn = RNNTLoss(blank=-1, fused_log_softmax=True).to("cuda")
# integrate in training_step to replace NeMo’s numba RNNT loss

This removes Numba and NVJitLink from the loop. (PyTorch Documentation)

B) Train CTC-only using the same encoder.
Convert the hybrid checkpoint to a pure CTC .nemo and fine-tune CTC:

# Script present in NeMo examples/helpers mirrors
# https://.../examples/asr/asr_hybrid_transducer_ctc/helpers/convert_nemo_asr_hybrid_to_ctc.py
python convert_nemo_asr_hybrid_to_ctc.py \
  -i stt_fa_fastconformer_hybrid_large.nemo \
  -o stt_fa_fastconformer_ctc_only.nemo \
  -m ctc

This path is widely used for deployment and avoids RNNT kernels entirely while retaining the FastConformer encoder. (mirrors.sustech.edu.cn)

Model-specific context

Target: nvidia/stt_fa_fastconformer_hybrid_large. FastConformer encoder with hybrid Transducer+CTC decoders (~115M). Designed for fine-tuning; default decoding is RNNT, but CTC is first-class and documented. (Hugging Face)
Your logs show RNNT loss invoked at step 1 despite low batch size and clean dataloader. That narrows the problem to RNNT JIT/linkage. (GitHub)

Pitfalls checklist

Do not set env vars inside a running notebook after Numba import. Start a clean process with flags set. (NVIDIA)
If you install pynvjitlink but forget NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY=1, MVC stays off. Set both. (Numba Official Document)
If you switch to torchaudio RNNT, ensure torchaudio and torch are the same CUDA build. Otherwise you can hit a different linker error. (PyTorch Documentation)

References

NVIDIA: training NeMo RNNT with Numba FP16. Versions, env flags, probe. (NVIDIA)
Numba MVC for CUDA 12. pynvjitlink and env flag. (Numba Official Document)
Farsi hybrid model card. Architecture, hybrid intent, decoding modes. (Hugging Face)
Hybrid YAML with compute_eval_loss: false. Practical stability tip. (GitHub)
Torchaudio RNNTLoss docs. Alternative backend. (PyTorch Documentation)

Topic		Replies	Views
Fine-tuning lm with nsp 🤗Transformers	0	1174	January 19, 2021
Loss is "nan" when fine-tuning NLI model (both RoBERTa/BART) Beginners	1	7242	December 16, 2020
Error when finetuning pretrained huggingface conv-ai chatbot model 🤗Transformers	2	818	April 19, 2021
Fine-tuning Wav2Vec2 for English ASR with 🤗 on local machine Transformers 🤗Transformers	1	434	August 10, 2021
Finetune bart for text summary has nan loss Amazon SageMaker	5	941	October 8, 2021