Doubt on Tokenization in Pegasus

Hi, i created a 16-2 pegasus student with then tried to use on XSUM dataset. The script i run is:

python --max_source_length 500 --data_dir xsum --freeze_encoder --freeze_embeds --learning_rate=1e-4 --do_train --do_predict --val_check_interval 0.1 --n_val 1000 --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 --model_name_or_path dpx_xsum_16_2 --train_batch_size=1 --eval_batch_size=1 --sortish_sampler --num_train_epochs=6 --warmup_steps 500 --output_dir distilpeg_xsum_sft_16_2 --gpus 0 --gradient_accumulation_steps 256 --adafactor --dropout 0.1 --attention_dropout 0.1 --overwrite_output_dir

The question is, is it normal if i don’t specify --max_source_length 500 i obtain an error during embedding? If i leave it like that the fine-tuning is efficient?

Thanks in advance!

I noticed on the script in the repo --max_source_lenght 512 is set and so i ran with such setting. But i notice that starting r2 score is 0.0 in metrics.json. Is this a problem?