CUDA error: device-side assert triggered after a certain steps

Hi, I am trying to train a zero-shot topic classification model on XNLI/vi dataset using phobert-base-v2 on Google Colab Pro+.

I got the following error whenever it reaches a certain steps with different batch_size:

RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

For batch_size = 16 it got to around 200 steps, with 8 it goes to around 600 steps

I have read through several posts which suggest the problem might come from indexing label. I have checked the index of labels column in the dataset which starts from 0-2. I also try to switch to CPU but it is very slow and consider that the bug only appears when the training came up to current steps, it might not indicate the right issues.

This is my training code

from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback, IntervalStrategy
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1", average="macro")
precision_metric = evaluate.load("precision", average="macro")
recall_metric = evaluate.load("recall", average="macro")

def compute_metrics(eval_preds):

    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    precision = precision_metric.compute(predictions=predictions, references=labels, average="macro")["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels, average="macro")["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")["f1"]

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

training_args = TrainingArguments(
   output_dir='./zero_shot_topic_classification',
   evaluation_strategy = IntervalStrategy.STEPS,
   eval_steps = 100,
   save_steps = 200,
   logging_steps = 100,
   learning_rate=2e-5,
   per_device_train_batch_size=8,
   per_device_eval_batch_size=8,
   num_train_epochs=50,
   weight_decay=0.01,
   save_strategy=IntervalStrategy.STEPS,
   push_to_hub=False,
   load_best_model_at_end = True,
   metric_for_best_model = 'f1',
   optim="adamw_torch"
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=validation_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
   callbacks = [EarlyStoppingCallback(early_stopping_patience=10)],
)

trainer.train()

This is when the batch_size is 8 the error appears at step 636

I am going through the same problem
I tried setting up the env by running os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”, but it didn’t work,
I am fine tuning a bert model on my own dataset for intent classification

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-fa5ddd935c58> in <cell line: 4>()
      2 import os
      3 os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
----> 4 trainer.train()

4 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py in empty_cache()
    131     """
    132     if is_initialized():
--> 133         torch._C._cuda_emptyCache()
    134 
    135 

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also, I tried to run it on CPU on colab but overthere I was getting IndexError: Target 2 is out of bounds.
Basically I am stuck in a loop of 3 three errors :tired_face:
Here’s my code, could anyone help me with this

from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertForSequenceClassification
# cartesinus/xlm-r-base-amazon-massive-intent-label_smoothing
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")

# Define training arguments
from transformers import Trainer, TrainingArguments
from torch import nn


training_args = TrainingArguments(
    per_device_train_batch_size=1,
    output_dir='./results',  # Directory where checkpoints and logs will be saved
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=100,  # Number of steps before evaluating on the validation set
    save_steps=100,  # Number of steps before saving a checkpoint
    load_best_model_at_end=True,
    push_to_hub=False,  # Set to True if you want to push the model to the Hugging Face Model Hub

)


from transformers import EvalPrediction

def custom_accuracy(p: EvalPrediction):
    # Extract predictions and label_ids
    predictions = p.predictions.argmax(axis=1)
    label_ids = p.label_ids

    # Calculate accuracy
    correct = (predictions == label_ids).sum()
    total = len(predictions)
    accuracy = correct / total

    # Return accuracy as a dictionary
    return {"accuracy": accuracy}




# Define a metric for evaluation (e.g., accuracy)
metric = load_metric("accuracy")

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=None,  # You can use your own data collator if needed
    compute_metrics=custom_accuracy
)

# Fine-tune the model on your dataset
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
trainer.train()

And here’s a what my dataset looks like, i.e. training data

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 492
})

if you got the Index Error you can try investigate the label indexing in your data. The out of bounds problem normally comes from that

Yeah but I think my labels are fine, like they start from zero as they should in a BERT model, here have a look at my dataset as a dataframe object

I think the problem lies in assigning the num_labels in the model from the config file of model
And now I am getting a new error

ValueError: Target size (torch.Size([1])) must be the same as input size (torch.Size([1, 2]))

I am getting the same error.
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The same case, batch size 2, at step 32, Pretraining the llama model,
did anyone find any solution?

I was able to get past this error by setting PYTORCH_USE_CUDA_DSA as “1”

os.environ["PYTORCH_USE_CUDA_DSA"] = "1"

torch version is 2.2.0