Error fine-tuning distilled Pegasus with run_seq2seq.py

smaximo · February 16, 2021, 1:40pm

Hello,

This is my first post in the forum.

I have successfully fine-tuned t5-small and distilled bart models using run_seq2seq.py.

When I try to fine-tune sshleifer/distill-pegasus-xsum-16-8:

!python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path $modelname \
    --do_train \
    --do_eval \
    --task summarization \
    --train_file $trainpath \
    --validation_file $valpath \
    --output_dir $modelsave \
    --overwrite_output_dir \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --predict_with_generate \
    --text_column ctext \
    --save_steps=100000 \
    --num_train_epochs=1 \
    --summary_column text

I get the following error:

5.27it/s]/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
..................
File "/kaggle/working/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 340, in forward
    if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any():
RuntimeError: CUDA error: device-side assert triggered
 14%|████▉                               | 3684/26890 [12:08<1:16:30,  5.05it/s]

I have tested with batches of size 1 and 2, and in both cases, the error is triggered at 14% of training steps.
The size of the training dataset is 26890 and the validation dataset 6720.
I tested the code in Google Colab and in Kaggle Kernels.

Has anyone successfully fine-tuned PEGASUS or a distilled version of PEGASUS using run_seq2seq.py? Which arguments did you use?

Thank you for your valuable time and help

nielsr · February 17, 2021, 3:55pm

Could you try to run the code on CPU? It usually helps to run the code on CPU when you have a CUDA error, to get a more informative error message.

smaximo · February 18, 2021, 10:41pm

I followed your suggestion and ran the code on CPU.

I got the following error:

…./trainer.py", line 1325, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
 “.../modeling_pegasus.py", line 1260, in forward
    return_dict=return_dict,
…..
".../transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 722, in forward
    embed_pos = self.embed_positions(input_shape)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  "..../transformers/models/pegasus/modeling_pegasus.py", line 138, in forward
    return super().forward(positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Searching in the forum there was a chance that it was related to max_position_embeddings.

The max_position_embedding for distill-pegasus-xsum-16-8 is 512.
My dataset contains one element greater than this value. I removed it and I was able to successfully complete the fine-tuning.

Thanks very much for your help!

Topic		Replies	Views
Seq2Seq Distillation: train_distilbart_xsum error 🤗Transformers	5	440	November 10, 2020
Fine-tuning Pegasus Models	33	10120	October 14, 2021
Fine-Tuning Pegasus - Model Not Training? Models	4	1738	March 14, 2021
Finetuning Pegasus for summarization task 🤗Transformers	3	1046	October 14, 2020
Error when Finetuning a Pegasus Student Beginners	0	258	November 21, 2020

Error fine-tuning distilled Pegasus with run_seq2seq.py

Related topics