CUDA error: device-side assert triggered after a certain steps

tolu07 · October 12, 2023, 11:30am

I am going through the same problem
I tried setting up the env by running os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”, but it didn’t work,
I am fine tuning a bert model on my own dataset for intent classification

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-fa5ddd935c58> in <cell line: 4>()
      2 import os
      3 os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
----> 4 trainer.train()

4 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py in empty_cache()
    131     """
    132     if is_initialized():
--> 133         torch._C._cuda_emptyCache()
    134 
    135 

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also, I tried to run it on CPU on colab but overthere I was getting IndexError: Target 2 is out of bounds.
Basically I am stuck in a loop of 3 three errors
Here’s my code, could anyone help me with this

from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertForSequenceClassification
# cartesinus/xlm-r-base-amazon-massive-intent-label_smoothing
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")

# Define training arguments
from transformers import Trainer, TrainingArguments
from torch import nn


training_args = TrainingArguments(
    per_device_train_batch_size=1,
    output_dir='./results',  # Directory where checkpoints and logs will be saved
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=100,  # Number of steps before evaluating on the validation set
    save_steps=100,  # Number of steps before saving a checkpoint
    load_best_model_at_end=True,
    push_to_hub=False,  # Set to True if you want to push the model to the Hugging Face Model Hub

)


from transformers import EvalPrediction

def custom_accuracy(p: EvalPrediction):
    # Extract predictions and label_ids
    predictions = p.predictions.argmax(axis=1)
    label_ids = p.label_ids

    # Calculate accuracy
    correct = (predictions == label_ids).sum()
    total = len(predictions)
    accuracy = correct / total

    # Return accuracy as a dictionary
    return {"accuracy": accuracy}




# Define a metric for evaluation (e.g., accuracy)
metric = load_metric("accuracy")

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=None,  # You can use your own data collator if needed
    compute_metrics=custom_accuracy
)

# Fine-tune the model on your dataset
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
trainer.train()

And here’s a what my dataset looks like, i.e. training data

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 492
})

Topic		Replies	Views
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CU 🤗Transformers	2	1159	November 1, 2024
[HELP] RuntimeError: CUDA error: device-side assert triggered Beginners	20	54485	October 23, 2024
CUDA error: device-side assert triggered 🤗Transformers	3	4276	June 4, 2021
CUDA error that only occurs on multiple gpus when doing batched training Beginners	0	769	June 25, 2024
RuntimeError: CUDA error: device-side assert triggered 4x10 🤗Transformers	0	177	April 11, 2024

CUDA error: device-side assert triggered after a certain steps

Related topics