Hello,
This is my first post in the forum.
I have successfully fine-tuned t5-small and distilled bart models using run_seq2seq.py.
When I try to fine-tune sshleifer/distill-pegasus-xsum-16-8:
!python examples/seq2seq/run_seq2seq.py \
--model_name_or_path $modelname \
--do_train \
--do_eval \
--task summarization \
--train_file $trainpath \
--validation_file $valpath \
--output_dir $modelsave \
--overwrite_output_dir \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=1 \
--predict_with_generate \
--text_column ctext \
--save_steps=100000 \
--num_train_epochs=1 \
--summary_column text
I get the following error:
5.27it/s]/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [213,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
..................
File "/kaggle/working/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 340, in forward
if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any():
RuntimeError: CUDA error: device-side assert triggered
14%|████▉ | 3684/26890 [12:08<1:16:30, 5.05it/s]
I have tested with batches of size 1 and 2, and in both cases, the error is triggered at 14% of training steps.
The size of the training dataset is 26890 and the validation dataset 6720.
I tested the code in Google Colab and in Kaggle Kernels.
Has anyone successfully fine-tuned PEGASUS or a distilled version of PEGASUS using run_seq2seq.py? Which arguments did you use?
Thank you for your valuable time and help