Triaging cudaErrorIllegalAddress Error

entropy · February 3, 2023, 8:57pm

I think this might be related to a NaN loss going into the FP16 scaler? ref

Not sure why the scaler wouldn’t catch that and skip the batch

edit: caught a few NaN batches going into the self.scaler.scale(loss).backward() step, but I’ve since also seen the error triggered by normal loss values

Topic		Replies	Views
CUDA Runtime Error in the Middle of Training Intermediate	1	1328	March 30, 2024
Training fails on multiple gpu throwing cuda runtime errors 🤗Transformers	0	922	September 30, 2022
Run_mlm.py cuda error memory after resuming a training 🤗Transformers	4	2908	April 21, 2021
Loss.backward() error after moving model from cpu to cuda 🧨 Diffusers	2	1031	May 28, 2024
Bypassing "CUDA error: unspecified launch failure" error from trainer checkpoint loading 🤗Transformers	0	223	July 11, 2024

Triaging cudaErrorIllegalAddress Error

Related topics