[HELP] RuntimeError: CUDA error: device-side assert triggered

Hello,

I am following this tutorial on how to train my language model from scratch: notebooks/language_modeling_from_scratch.ipynb at master · huggingface/notebooks · GitHub

However, when I pass everything to my trainer:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)
trainer.train()

I get this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

My advice is always: if you have a CUDA error, run your code on CPU and check if you’re getting a more helpful error message.

1 Like

OK, thanks!

How do you run on the CPU on Google Colab?

By clicking “Runtime” on top => set hardware accelerator to “None”.

Whenever I get CUDA errors, it is always either a label mismatch (model having X number of classes but the dataset has instances labeled X+1 for example) or a tokenizer-model mismatch (“bert-base-uncased” tokenizer for “roberta-base” model for example). Other than that, running the code on CPU is the way to go to increase debuggability, as suggested.

2 Likes

Hi, @nielsr and @ehalit - I tried it on the CPU, and this is now the error:

IndexError: index out of range in self

Can you provide the entire error message? Probably, the error happens in an embedding layer.

***** Running training *****
  Num examples = 111530
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 41826

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-28-3435b262f1ae> in <module>()
----> 1 trainer.train()

11 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1284                         tr_loss += self.training_step(model, inputs)
   1285                 else:
-> 1286                     tr_loss += self.training_step(model, inputs)
   1287                 self.current_flos += float(self.floating_point_ops(inputs))
   1288 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
   1777                 loss = self.compute_loss(model, inputs)
   1778         else:
-> 1779             loss = self.compute_loss(model, inputs)
   1780 
   1781         if self.args.n_gpu > 1:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1809         else:
   1810             labels = None
-> 1811         outputs = model(**inputs)
   1812         # Save past state if it exists
   1813         # TODO: this needs to be fixed and made cleaner later.

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, labels, output_attentions, output_hidden_states, return_dict)
   1338             output_attentions=output_attentions,
   1339             output_hidden_states=output_hidden_states,
-> 1340             return_dict=return_dict,
   1341         )
   1342 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    987             token_type_ids=token_type_ids,
    988             inputs_embeds=inputs_embeds,
--> 989             past_key_values_length=past_key_values_length,
    990         )
    991         encoder_outputs = self.encoder(

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    213 
    214         if inputs_embeds is None:
--> 215             inputs_embeds = self.word_embeddings(input_ids)
    216         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    217 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
    158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
--> 160             self.norm_type, self.scale_grad_by_freq, self.sparse)
    161 
    162     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

IndexError: index out of range in self

I have basically trained a tokenizer from scratch for a masked language task, but when I actually train my language model, I get this error. When training my tokenizer, do I need [SEP] and [CLS] as special tokens?

Since you are getting an index out of range error, there indeed seems to be a mismatch between the ground-truth labels and the prediction layer of the model. Since you are working on a language classification task, this is probably caused by the tokenizer, which means you are facing both of the problems I mentioned.

Unfortunately, I have no experience with custom tokenizers but if the model architecture needs the special tokens, for example BERT will always need [MASK] for MLM and [SEP] for NSP, I believe you will have to include them in the vocabulary of your newly trained tokenizer.

Hi - I have actually included the [MASK] token in my vocabulary, so I am quite unsure what is causing the problem.

@nielsr Is there a way to increase the number of target classes?

@anon58275033 @nielsr @ehalit I am facing the same issue with BERT while training with trainer.
It gives me same error like
"RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

can you please help me to find solution??

This error typically happens when the number of classes in the output of the model and the number of the classes in training data mismatch. Say you initialized a SequenceClassification model with num_labels = 10 and the data has 11 or more classes, you will get this error.

In text generation scenarios, this happens when you use a different tokenizer than the one used for the pretrained model. The mechanics are the same, the vocabulary size of the model and the tokenizer mismatch.

1 Like

Thank you @ehalit

This helped me with one of the issue. Thank you.

1 Like