Error fine-tuning distilled Pegasus with run_seq2seq.py

Hello,

This is my first post in the forum.

I have successfully fine-tuned t5-small and distilled bart models using run_seq2seq.py.

When I try to fine-tune sshleifer/distill-pegasus-xsum-16-8:

!python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path $modelname \
    --do_train \
    --do_eval \
    --task summarization \
    --train_file $trainpath \
    --validation_file $valpath \
    --output_dir $modelsave \
    --overwrite_output_dir \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --predict_with_generate \
    --text_column ctext \
    --save_steps=100000 \
    --num_train_epochs=1 \
    --summary_column text

I get the following error:

5.27it/s]/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
..................
File "/kaggle/working/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 340, in forward
    if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any():
RuntimeError: CUDA error: device-side assert triggered
 14%|████▉                               | 3684/26890 [12:08<1:16:30,  5.05it/s]

I have tested with batches of size 1 and 2, and in both cases, the error is triggered at 14% of training steps.
The size of the training dataset is 26890 and the validation dataset 6720.
I tested the code in Google Colab and in Kaggle Kernels.

Has anyone successfully fine-tuned PEGASUS or a distilled version of PEGASUS using run_seq2seq.py? Which arguments did you use?

Thank you for your valuable time and help

Could you try to run the code on CPU? It usually helps to run the code on CPU when you have a CUDA error, to get a more informative error message.

I followed your suggestion and ran the code on CPU.

I got the following error:

…./trainer.py", line 1325, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
 “.../modeling_pegasus.py", line 1260, in forward
    return_dict=return_dict,
…..
".../transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 722, in forward
    embed_pos = self.embed_positions(input_shape)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  "..../transformers/models/pegasus/modeling_pegasus.py", line 138, in forward
    return super().forward(positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Searching in the forum there was a chance that it was related to max_position_embeddings.

The max_position_embedding for distill-pegasus-xsum-16-8 is 512.
My dataset contains one element greater than this value. I removed it and I was able to successfully complete the fine-tuning.

Thanks very much for your help!

1 Like