Codet5 fails with a CUDA error

strangetcy · January 19, 2022, 2:40pm

I’m trying to reproduce the codet5 fine-tuning results (GitHub - salesforce/CodeT5: Code for CodeT5: a new code-aware pre-trained encoder-decoder model.)
The script being used is:

python3 /home/ubuntu/CodeT5/run_gen.py \
--task summarize \
--sub_task python \
--summary_dir /home/ubuntu/CodeT5/summary \
--cache_path /home/ubuntu/CodeT5/cache \
--data_dir /home/ubuntu/CodeT5/data \
--res_dir /home/ubuntu/CodeT5/res \
--output_dir /home/ubuntu/CodeT5/output \
--save_last_checkpoints \
--always_save_model \
--do_eval_bleu \
--model_name_or_path='Salesforce/codet5-base-multi-sum' \
--tokenizer_name='Salesforce/codet5-base-multi-sum' \
--train_filename /home/ubuntu/CodeT5/data/summarize/python/train.jsonl \
--dev_filename /home/ubuntu/CodeT5/data/summarize/python/valid.jsonl \
--test_filename /home/ubuntu/CodeT5/data/summarize/python/test.jsonl \
--do_train \
--do_eval \
--do_test \
--save_steps=500 \
--log_steps=100 \
--local_rank=-1

Running it leads to the following error:

Traceback (most recent call last):
  File "/home/ubuntu/CodeT5/run_gen.py", line 387, in <module>
    main()
  File "/home/ubuntu/CodeT5/run_gen.py", line 234, in main
    outputs = model(input_ids=source_ids, attention_mask=source_mask,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1561, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 998, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 639, in forward
    self_attention_outputs = self.layer[0](
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 546, in forward
    attention_output = self.SelfAttention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 483, in forward
    scores = torch.matmul(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

When run with an extra --no_cuda flag, ithe training script produces this error:

Traceback (most recent call last):
  File "/home/ubuntu/CodeT5/run_gen.py", line 394, in <module>
    main()
  File "/home/ubuntu/CodeT5/run_gen.py", line 241, in main
    outputs = model(input_ids=source_ids, attention_mask=source_mask,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1561, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 898, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

My guess is something fishy is going on with the source_ids, but I haven’t been able to figure it out.
A simple test shows that source_ids has shape torch.Size([8, 64]) , while target_ids have torch.Size([8, 64]) .

I wonder:

whether (and how) this can be debugged and fixed?
whether a Trainer can be adapted to this fine-tuning task? The main things I expect the Trainer to provide are deepspeed support and wandb reporting integration

UPD: an example summarization script seems to work fine. Perhpas I’ll just have to tweak it a bit so that it can summarize code instead of natural language

strangetcy · January 21, 2022, 1:57pm

Ok, to the best of my understanding at the moment, it goes like this:
the example summarization script uses certain (I presume standard/based on the original paper) values for max_source_length and max_target_length, which are 1024 and 128, respectively.
But codet5 uses 64 and 32 respectively, for some reason I was unable to discern.
Anyway, once those were changed, it … didn’t work, since it didn’t fit into a single a100 memory,
but then I changed the batch_size from 8 to 4, and now it does work.
Hooray!

Topic		Replies	Views
Quantization Aware Training Error 🤗Transformers	0	524	August 28, 2023
Error: Fine-tune GPT2 model for question answer task 🤗Transformers	1	791	June 16, 2023
CUDA out of memory only during validation not training 🤗Transformers	3	4525	May 9, 2023
Perfectly the same code, single GPU OK, multi GPU ERROR Beginners	0	79	December 1, 2024
T5 Model For Summarization On GPU Beginners	0	333	February 1, 2024

Codet5 fails with a CUDA error

Related topics