Error running run_semantic_segmentation.py

Hi there, had issues with running the above script in colab at the following link: transformers/run_semantic_segmentation.py at main · huggingface/transformers · GitHub

Training args I used are below:

!python /content/drive/MyDrive/run_semantic_segmentation.py
–model_name_or_path nvidia/mit-b5
–dataset_name nickmuchi/rugd-dataset-all
–output_dir /content/drive/MyDrive/segformer-finetuned-rugd-out
–remove_unused_columns False
–do_train
–do_eval
–evaluation_strategy steps
–push_to_hub
–push_to_hub_model_id segformer-finetuned-rugd
–max_steps 10000
–learning_rate 0.00006
–lr_scheduler_type polynomial
–per_device_train_batch_size 2
–per_device_eval_batch_size 2
–logging_strategy steps
–logging_steps 100
–evaluation_strategy epoch
–save_strategy epoch
–save_total_limit 2
–load_best_model_at_end True
–seed 1337
–max_train_samples 3000
–max_eval_samples 500

Error:

Traceback (most recent call last):
File “/content/drive/MyDrive/run_semantic_segmentation.py”, line 508, in
main()
File “/content/drive/MyDrive/run_semantic_segmentation.py”, line 483, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 1324, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 1559, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 2206, in training_step
loss.backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
0% 0/10000 [00:00<?, ?it/s]

Tried googling but did not get much, thanks.

Hi @nielsr could you please assist as you recently did the demo? Thanks

hi nickmuchi.

if you see more detail of error, use this code at front of your codes.

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

there is so many error case that rising CUDA error.

here is some case i faced before. i hope its help you.

1. input dimension mismatch.
layer shape is (512,1) but i put (512,13), 13 is my custom data label_num.

2. input data error
you need check input data’s tensor range is correct.
my data was [-1,1,1,1] but range is (0~75), it rising cuda, cublas error.

3. version conflict
I do not recommend this case, but I write it just in case.
if you install your env with official document’s requirements, pass it.
but if not, check lib’s version.
there is almost conflict occur by cuda, cuddn, GPU.
also pytorch, transformer, etc lib version need to check.

regards