Hi Patrick! ,
Since we have many new great pretrained models on T5 (not yet considering mT5), I would love to try summarize meaning of their postfixs to make sure we understand them correclty.
v1_1 or xl or xxl – use minor-change architecture to original T5 detailed in here and here . Due to these minor changes, number of parameters change a bit, so
11B (not sure if they are bigger or smaller than before) .
They also pretrained only on C4 (ie. not pretrained on multi-task supervised datasets like original T5) .
ssm – use salient span masking detailed in the paper Section 3. This special masking significantly improves model’s world knowledge.
tqa – finetuned on Trivia Q&A dataset , using 100% of training data
tqao – like above but using only 90% of training data
wq – finetuned on Web Question dataset, using 100% of training data
wqo – like above but using only 90% of training data
nq – finetuned on Google Natural Question dataset, using 100% of training data
nqo – like above but using only 90% of training data
Also want to note that although official metric performance of these SSM-pretrained looks inferior to Open-book models like DPR, the authors note in the paper using manual evaluation that around 30% of “officially wrong answers” are “false negative” as T5 freely generated answers may not match the gold-truth perfectly (but in fact they are correct) .
For example, in Close-book NQ task, taking into account these false negative, T5-XXL-SSM is estimated to has 0.57 metric points compared to official metric of 0.37 and DPR’s SOTA of 0.42