CUDA error: device-side assert triggered after a certain steps

shayonhuggingface · July 21, 2023, 9:41am

Hi, I am trying to train a zero-shot topic classification model on XNLI/vi dataset using phobert-base-v2 on Google Colab Pro+.

I got the following error whenever it reaches a certain steps with different batch_size:

RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

For batch_size = 16 it got to around 200 steps, with 8 it goes to around 600 steps

I have read through several posts which suggest the problem might come from indexing label. I have checked the index of labels column in the dataset which starts from 0-2. I also try to switch to CPU but it is very slow and consider that the bug only appears when the training came up to current steps, it might not indicate the right issues.

This is my training code

from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback, IntervalStrategy
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1", average="macro")
precision_metric = evaluate.load("precision", average="macro")
recall_metric = evaluate.load("recall", average="macro")

def compute_metrics(eval_preds):

    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    precision = precision_metric.compute(predictions=predictions, references=labels, average="macro")["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels, average="macro")["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")["f1"]

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

training_args = TrainingArguments(
   output_dir='./zero_shot_topic_classification',
   evaluation_strategy = IntervalStrategy.STEPS,
   eval_steps = 100,
   save_steps = 200,
   logging_steps = 100,
   learning_rate=2e-5,
   per_device_train_batch_size=8,
   per_device_eval_batch_size=8,
   num_train_epochs=50,
   weight_decay=0.01,
   save_strategy=IntervalStrategy.STEPS,
   push_to_hub=False,
   load_best_model_at_end = True,
   metric_for_best_model = 'f1',
   optim="adamw_torch"
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=validation_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
   callbacks = [EarlyStoppingCallback(early_stopping_patience=10)],
)

trainer.train()

shayonhuggingface · July 21, 2023, 9:42am

This is when the batch_size is 8 the error appears at step 636

tolu07 · October 12, 2023, 11:30am

I am going through the same problem
I tried setting up the env by running os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”, but it didn’t work,
I am fine tuning a bert model on my own dataset for intent classification

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-fa5ddd935c58> in <cell line: 4>()
      2 import os
      3 os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
----> 4 trainer.train()

4 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py in empty_cache()
    131     """
    132     if is_initialized():
--> 133         torch._C._cuda_emptyCache()
    134 
    135 

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also, I tried to run it on CPU on colab but overthere I was getting IndexError: Target 2 is out of bounds.
Basically I am stuck in a loop of 3 three errors
Here’s my code, could anyone help me with this

from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertForSequenceClassification
# cartesinus/xlm-r-base-amazon-massive-intent-label_smoothing
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")

# Define training arguments
from transformers import Trainer, TrainingArguments
from torch import nn


training_args = TrainingArguments(
    per_device_train_batch_size=1,
    output_dir='./results',  # Directory where checkpoints and logs will be saved
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=100,  # Number of steps before evaluating on the validation set
    save_steps=100,  # Number of steps before saving a checkpoint
    load_best_model_at_end=True,
    push_to_hub=False,  # Set to True if you want to push the model to the Hugging Face Model Hub

)


from transformers import EvalPrediction

def custom_accuracy(p: EvalPrediction):
    # Extract predictions and label_ids
    predictions = p.predictions.argmax(axis=1)
    label_ids = p.label_ids

    # Calculate accuracy
    correct = (predictions == label_ids).sum()
    total = len(predictions)
    accuracy = correct / total

    # Return accuracy as a dictionary
    return {"accuracy": accuracy}




# Define a metric for evaluation (e.g., accuracy)
metric = load_metric("accuracy")

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=None,  # You can use your own data collator if needed
    compute_metrics=custom_accuracy
)

# Fine-tune the model on your dataset
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
trainer.train()

And here’s a what my dataset looks like, i.e. training data

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 492
})

shayonhuggingface · October 12, 2023, 1:17pm

if you got the Index Error you can try investigate the label indexing in your data. The out of bounds problem normally comes from that

tolu07 · October 13, 2023, 1:09pm

Yeah but I think my labels are fine, like they start from zero as they should in a BERT model, here have a look at my dataset as a dataframe object

I think the problem lies in assigning the num_labels in the model from the config file of model
And now I am getting a new error

ValueError: Target size (torch.Size([1])) must be the same as input size (torch.Size([1, 2]))

dSiddhesh · December 29, 2023, 7:52am

I am getting the same error.
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The same case, batch size 2, at step 32, Pretraining the llama model,
did anyone find any solution?

ankitb131313 · January 31, 2024, 3:04am

I was able to get past this error by setting PYTORCH_USE_CUDA_DSA as “1”

os.environ["PYTORCH_USE_CUDA_DSA"] = "1"

torch version is 2.2.0

Ercheia · July 24, 2024, 2:02am

To me, checking the input shape helped. It turned out I mistakenly assigned the wrong dimension to the input data.

Topic		Replies	Views
CUDA error: device-side assert triggered 🤗Transformers	3	4289	June 4, 2021
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CU 🤗Transformers	2	1336	November 1, 2024
[HELP] RuntimeError: CUDA error: device-side assert triggered Beginners	20	55372	October 23, 2024
RuntimeError: CUDA error: device-side assert triggered 🤗Transformers	1	2513	April 28, 2021
[HELP] RuntimeError: CUDA error - when training my model? Beginners	2	2526	August 24, 2021

CUDA error: device-side assert triggered after a certain steps

Related topics