I’m trying to fix a weird cuda memory error that pops up seemingly at random during training. This is the stack trace:
File "train.py", line 99, in <module>
trainer.train()
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 1527, in train
return inner_training_loop(
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/accelerate/utils/memory.py", line 79, in decorator
return function(batch_size, *args, **kwargs)
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/transformers/trainer.py", line 2533, in training_step
self.scaler.scale(loss).backward()
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dmai/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
The error is always unique_by_key: failed to synchronize: cudaErrorIllegalAddress
, and it always occurs during the .backward()
call of the training loop. This is my training code. The processed dataset is just input_ids
and attention_mask
, nothing else.
dataset = datasets.load_from_disk('processed_data.hf')
tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer", max_len=160, use_fast=True)
config = RobertaConfig(
vocab_size=8248,
max_position_embeddings=256,
num_attention_heads=8,
num_hidden_layers=6,
type_vocab_size=1)
model = RobertaForMaskedLM(config=config)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
bs = 256
accum = 4
training_args = TrainingArguments(
output_dir="./checkpoints",
overwrite_output_dir=True,
num_train_epochs=1,
auto_find_batch_size=True,
per_device_train_batch_size=bs,
save_steps=500,
save_total_limit=4,
prediction_loss_only=True,
report_to='none',
gradient_accumulation_steps=accum,
fp16=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
My initial thought was this was a memory issue, so I lowered my batch size to the point where I’m only using 50% of my GPU memory during training - issue persists.
Then I thought there might be a few really long training examples causing memory issues when they found their way into the dataloader, so I removed the top 25% longest examples - the issue persists.
Does anyone know how I might go about finding the cause of this? I haven’t been able to recreate the issue on CPU, so I don’t have a more informative error message. Training progresses normally using only 50% GPU memory, then randomly goes off the rails.
Versions running:
Pytorch 1.11.0 (py3.8_cuda11.3_cudnn8.2.0_0)
Cuda 11.1
Transformers 4.25.1
Accelerate 0.12.0