Run_summarization.py t5 model output inconsistent results

I am working on T5 finetune model , and I found the t5-small-finetuned-samsum-en model on huggingface and the rouge metric on samsum Benchmark (Summarization) | Papers With Code
It shows: (I have delete other models except t5-small-finetuned-samsum-en)

Rank Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-LSUM gen_len loss Details Year Tags
12 t5-small-finetuned-samsum-en 40.039 15.85 31.808 36.089 18.107 2.192 2022

But when I use transformers examples code run_summarization.py run the model I have download from huggingface [t5-small-finetuned-samsum-en] , The result is not consistent with above.
My result is: my ROUGE-1 is 3.34 but the result above is 40.039, It is very different. Here is my result: (The value should *100)
***** eval metrics *****
eval_gen_len = 12.2066
eval_loss = 7.7219
eval_rouge1_high_fmeasure = 0.0383
eval_rouge1_high_precision = 0.2029
eval_rouge1_high_recall = 0.0225
eval_rouge1_low_fmeasure = 0.0334
eval_rouge1_low_precision = 0.1792
eval_rouge1_low_recall = 0.0194
eval_rouge1_mid_fmeasure = 0.0358
eval_rouge1_mid_precision = 0.1909
eval_rouge1_mid_recall = 0.0209
eval_rouge2_high_fmeasure = 0.0021
eval_rouge2_high_precision = 0.0114
eval_rouge2_high_recall = 0.0012
eval_rouge2_low_fmeasure = 0.0011
eval_rouge2_low_precision = 0.006
eval_rouge2_low_recall = 0.0006
eval_rouge2_mid_fmeasure = 0.0016
eval_rouge2_mid_precision = 0.0086
eval_rouge2_mid_recall = 0.0009
eval_rougeL_high_fmeasure = 0.0334
eval_rougeL_high_precision = 0.1803
eval_rougeL_high_recall = 0.0197
eval_rougeL_low_fmeasure = 0.0293
eval_rougeL_low_precision = 0.1587
eval_rougeL_low_recall = 0.0171
eval_rougeL_mid_fmeasure = 0.0314
eval_rougeL_mid_precision = 0.1688
eval_rougeL_mid_recall = 0.0184
eval_rougeLsum_high_fmeasure = 0.0362
eval_rougeLsum_high_precision = 0.1951
eval_rougeLsum_high_recall = 0.0213
eval_rougeLsum_low_fmeasure = 0.0317
eval_rougeLsum_low_precision = 0.1708
eval_rougeLsum_low_recall = 0.0184
eval_rougeLsum_mid_fmeasure = 0.034
eval_rougeLsum_mid_precision = 0.1832
eval_rougeLsum_mid_recall = 0.0199
eval_runtime = 0:00:54.57
eval_samples = 818
eval_samples_per_second = 14.988
eval_steps_per_second = 14.988
And I have not change the run_summarization.py script ,
My shell script is:
python ./run_summarization_.py
–model_name_or_path $T5_DIR
–do_eval
–source_prefix "summarize: "
–output_dir ./output/huggingface-summarization
–dataset_name $samsum_path
–dataset_config “3.0.0”
–per_device_train_batch_size=1
–per_device_eval_batch_size=1
–overwrite_output_dir
–predict_with_generate \

Can anyone help me on this issue? Thanks a lot.