Triaging cudaErrorIllegalAddress Error

I’m trying to fix a weird cuda memory error that pops up seemingly at random during training. This is the stack trace:

  File "train.py", line 99, in <module>
    trainer.train()
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/accelerate/utils/memory.py", line 79, in decorator
    return function(batch_size, *args, **kwargs)
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 2533, in training_step
    self.scaler.scale(loss).backward()
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

The error is always unique_by_key: failed to synchronize: cudaErrorIllegalAddress, and it always occurs during the .backward() call of the training loop. This is my training code. The processed dataset is just input_ids and attention_mask, nothing else.

dataset = datasets.load_from_disk('processed_data.hf')

tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer", max_len=160, use_fast=True)

config = RobertaConfig(
    vocab_size=8248,
    max_position_embeddings=256,
    num_attention_heads=8,
    num_hidden_layers=6,
    type_vocab_size=1)

model = RobertaForMaskedLM(config=config)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

bs = 256
accum = 4

training_args = TrainingArguments(
    output_dir="./checkpoints",
    overwrite_output_dir=True,
    num_train_epochs=1,
    auto_find_batch_size=True,
    per_device_train_batch_size=bs,
    save_steps=500,
    save_total_limit=4,
    prediction_loss_only=True,
    report_to='none',
    gradient_accumulation_steps=accum,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

My initial thought was this was a memory issue, so I lowered my batch size to the point where I’m only using 50% of my GPU memory during training - issue persists.

Then I thought there might be a few really long training examples causing memory issues when they found their way into the dataloader, so I removed the top 25% longest examples - the issue persists.

Does anyone know how I might go about finding the cause of this? I haven’t been able to recreate the issue on CPU, so I don’t have a more informative error message. Training progresses normally using only 50% GPU memory, then randomly goes off the rails.

Versions running:
Pytorch 1.11.0 (py3.8_cuda11.3_cudnn8.2.0_0)
Cuda 11.1
Transformers 4.25.1
Accelerate 0.12.0

1 Like

I think this might be related to a NaN loss going into the FP16 scaler? ref

Not sure why the scaler wouldn’t catch that and skip the batch

edit: caught a few NaN batches going into the self.scaler.scale(loss).backward() step, but I’ve since also seen the error triggered by normal loss values

@entropy
I’m currently getting a similar error. Did you ever resolve this?