Things I’ve found
- task prefixes matter when
1. When doing multi-task training
2. When your task similar or related to one of the supervised tasks used in T5 pre-training mixture. - Needs slightly higher LR than the default one set in Trainer, in my experiments 1e-4 and 3e-4 worked for almost all problems (classification, QA, que-gen, summ)
- no need to pass
decoder_input_ids
to T5 yourself, just passlabels
and theT5Model
will prepare them for you. labels should end witheos_token
. (important! This is where most of the mistakes are happening). - T5 uses
pad_token
as thedecoder_start_token_id
so when doing generation without thegenerate
function make sure you start it with pad token. - trimming batches when training on TPU leads to very slower training.
- apparently, because of sentencepiece and some possible leakage of other languages in C4 data, T5 gives somewhat sensible results for french lang. fine-tuned it on FQuAD (french version of SQuAD) for que gen and BLUE-4 against dev set was 15.
Not sure if it’s an issue or not but in some cases using label_smoothing
in T5 resulted in nan
loss