Here’ what I think
- The default task for
finetune.py
is summarization and it uses thegenerate
parameters for summrization tasks, which are not useful here. - 142
eval_max_gen_length
seems too large for QA task, should be lower IMO - using
beam_search
might not give good results for QA, in the T5 paper they used greedy decoding for QA.
When using generate it could be using summarization
generate parameters which could explain the longer answers.
Try using greedy decoding with generate
, set num_beams
to 1, smaller max_length
(32 should enough, for SQuAD 16 is fine), and 0 length_penalty
LMK if this helps