[HELP] RuntimeError: CUDA error: device-side assert triggered

anon58275033 · August 23, 2021, 6:49pm

Hello,

I am following this tutorial on how to train my language model from scratch: notebooks/language_modeling_from_scratch.ipynb at master · huggingface/notebooks · GitHub

However, when I pass everything to my trainer:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

trainer.train()

I get this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

nielsr · August 23, 2021, 6:55pm

My advice is always: if you have a CUDA error, run your code on CPU and check if you’re getting a more helpful error message.

anon58275033 · August 23, 2021, 7:10pm

OK, thanks!

anon58275033 · August 23, 2021, 7:24pm

How do you run on the CPU on Google Colab?

nielsr · August 23, 2021, 7:52pm

By clicking “Runtime” on top => set hardware accelerator to “None”.

ehalit · August 24, 2021, 4:44am

Whenever I get CUDA errors, it is always either a label mismatch (model having X number of classes but the dataset has instances labeled X+1 for example) or a tokenizer-model mismatch (“bert-base-uncased” tokenizer for “roberta-base” model for example). Other than that, running the code on CPU is the way to go to increase debuggability, as suggested.

anon58275033 · August 24, 2021, 9:39am

Hi, @nielsr and @ehalit - I tried it on the CPU, and this is now the error:

IndexError: index out of range in self

nielsr · August 24, 2021, 9:52am

Can you provide the entire error message? Probably, the error happens in an embedding layer.

anon58275033 · August 24, 2021, 9:56am

***** Running training *****
  Num examples = 111530
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 41826

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-28-3435b262f1ae> in <module>()
----> 1 trainer.train()

11 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1284                         tr_loss += self.training_step(model, inputs)
   1285                 else:
-> 1286                     tr_loss += self.training_step(model, inputs)
   1287                 self.current_flos += float(self.floating_point_ops(inputs))
   1288 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
   1777                 loss = self.compute_loss(model, inputs)
   1778         else:
-> 1779             loss = self.compute_loss(model, inputs)
   1780 
   1781         if self.args.n_gpu > 1:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1809         else:
   1810             labels = None
-> 1811         outputs = model(**inputs)
   1812         # Save past state if it exists
   1813         # TODO: this needs to be fixed and made cleaner later.

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, labels, output_attentions, output_hidden_states, return_dict)
   1338             output_attentions=output_attentions,
   1339             output_hidden_states=output_hidden_states,
-> 1340             return_dict=return_dict,
   1341         )
   1342 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    987             token_type_ids=token_type_ids,
    988             inputs_embeds=inputs_embeds,
--> 989             past_key_values_length=past_key_values_length,
    990         )
    991         encoder_outputs = self.encoder(

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    213 
    214         if inputs_embeds is None:
--> 215             inputs_embeds = self.word_embeddings(input_ids)
    216         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    217 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
    158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
--> 160             self.norm_type, self.scale_grad_by_freq, self.sparse)
    161 
    162     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

IndexError: index out of range in self

anon58275033 · August 24, 2021, 10:00am

I have basically trained a tokenizer from scratch for a masked language task, but when I actually train my language model, I get this error. When training my tokenizer, do I need [SEP] and [CLS] as special tokens?

ehalit · August 24, 2021, 11:41am

Since you are getting an index out of range error, there indeed seems to be a mismatch between the ground-truth labels and the prediction layer of the model. Since you are working on a language classification task, this is probably caused by the tokenizer, which means you are facing both of the problems I mentioned.

Unfortunately, I have no experience with custom tokenizers but if the model architecture needs the special tokens, for example BERT will always need [MASK] for MLM and [SEP] for NSP, I believe you will have to include them in the vocabulary of your newly trained tokenizer.

anon58275033 · August 24, 2021, 2:02pm

Hi - I have actually included the [MASK] token in my vocabulary, so I am quite unsure what is causing the problem.

anon58275033 · August 25, 2021, 9:09am

@nielsr Is there a way to increase the number of target classes?

hiraltalsaniya · October 3, 2023, 8:32am

@anon58275033 @nielsr @ehalit I am facing the same issue with BERT while training with trainer.
It gives me same error like
"RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

can you please help me to find solution??

ehalit · October 3, 2023, 10:03am

This error typically happens when the number of classes in the output of the model and the number of the classes in training data mismatch. Say you initialized a SequenceClassification model with num_labels = 10 and the data has 11 or more classes, you will get this error.

In text generation scenarios, this happens when you use a different tokenizer than the one used for the pretrained model. The mechanics are the same, the vocabulary size of the model and the tokenizer mismatch.

hiraltalsaniya · October 3, 2023, 10:45am

Thank you @ehalit

iamkhadke · December 10, 2023, 11:47am

This helped me with one of the issue. Thank you.

xiao1234567890 · May 23, 2024, 1:46am

when i run animatediff with sd webui
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [969,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [969,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [969,0,0], thread: [34,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [969,0,0], thread: [35,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [969,0,0], thread: [36,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

who can help me

Soham0708 · June 19, 2024, 6:01am

Hello @nielsr ,

I am encountering an “index out of range in self” error while fine-tuning IDEfICS2. Initially, I suspected the issue might be with my dataset. However, I experienced the same error even when using your dataset.

Could you please provide more details or guidance on how to resolve this issue?

arjuntheprogrammer · July 6, 2024, 2:54pm

I saw this kind of error when the size of input rows was not equal to the target output.

Specifically changing the value of num_labels in AutoModelForSequenceClassification.from_pretrainedresolved the error in my code.

Topic		Replies	Views
[HELP] RuntimeError: CUDA error - when training my model? Beginners	2	2526	August 24, 2021
CUDA error: device-side assert triggered 🤗Transformers	3	4289	June 4, 2021
RuntimeError: CUDA error: device-side assert triggered 🤗Transformers	1	2513	April 28, 2021
[HELP] How to fix IndexError: index out of range in self Beginners	1	1569	March 31, 2023
CUDA error: device-side assert triggered after a certain steps Beginners	7	15960	July 24, 2024

[HELP] RuntimeError: CUDA error: device-side assert triggered

Related topics