Error running run_semantic_segmentation.py

nickmuchi · May 17, 2022, 1:03am

Hi there, had issues with running the above script in colab at the following link: transformers/run_semantic_segmentation.py at main · huggingface/transformers · GitHub

Training args I used are below:

!python /content/drive/MyDrive/run_semantic_segmentation.py
–model_name_or_path nvidia/mit-b5
–dataset_name nickmuchi/rugd-dataset-all
–output_dir /content/drive/MyDrive/segformer-finetuned-rugd-out
–remove_unused_columns False
–do_train
–do_eval
–evaluation_strategy steps
–push_to_hub
–push_to_hub_model_id segformer-finetuned-rugd
–max_steps 10000
–learning_rate 0.00006
–lr_scheduler_type polynomial
–per_device_train_batch_size 2
–per_device_eval_batch_size 2
–logging_strategy steps
–logging_steps 100
–evaluation_strategy epoch
–save_strategy epoch
–save_total_limit 2
–load_best_model_at_end True
–seed 1337
–max_train_samples 3000
–max_eval_samples 500

Error:

Traceback (most recent call last):
File “/content/drive/MyDrive/run_semantic_segmentation.py”, line 508, in
main()
File “/content/drive/MyDrive/run_semantic_segmentation.py”, line 483, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 1324, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 1559, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File “/usr/local/lib/python3.7/dist-packages/transformers/trainer.py”, line 2206, in training_step
loss.backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
0% 0/10000 [00:00<?, ?it/s]

Tried googling but did not get much, thanks.

nickmuchi · May 20, 2022, 11:21am

Hi @nielsr could you please assist as you recently did the demo? Thanks

cog · May 23, 2022, 8:03am

hi nickmuchi.

if you see more detail of error, use this code at front of your codes.

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

there is so many error case that rising CUDA error.

here is some case i faced before. i hope its help you.

1. input dimension mismatch.
layer shape is (512,1) but i put (512,13), 13 is my custom data label_num.

2. input data error
you need check input data’s tensor range is correct.
my data was [-1,1,1,1] but range is (0~75), it rising cuda, cublas error.

3. version conflict
I do not recommend this case, but I write it just in case.
if you install your env with official document’s requirements, pass it.
but if not, check lib’s version.
there is almost conflict occur by cuda, cuddn, GPU.
also pytorch, transformer, etc lib version need to check.

regards

Topic		Replies	Views
InvalidArgumentError when training Segformer 🤗Transformers	0	308	March 24, 2024
In Colab, the part I gave below gives errors in all codes. Please can you help? 🤗Transformers	4	354	December 29, 2023
RuntimeError when running on Colab GPU 🤗Transformers	2	3488	November 28, 2021
Can't train TFSegformer for Semantic Segmentation Beginners	0	239	March 21, 2024
How to train a Semantic Segmentation model using transformers tensorflow2 API 🤗Transformers	0	412	October 12, 2022

Error running run_semantic_segmentation.py

Related topics