Best practices for estimating FLOPs-per-token with real datasets?

Hi folks,

I’m currently reading the T-Few paper on few-shot learning and in section 4.2 they provide a table and estimate of the 11B parameter model’s inference costs as follows:

Screen Shot 2022-09-20 at 14.37.53

We summarize the costs in table 1 and discuss them below. For all estimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation and our unlikelihood loss both require processing every possible output choice to attain a prediction for an unlabeled example. The median combined tokenized sequence length for the input and all possible targets is 103 for the datasets we consider.Processing a single input and all target choices with T-Few requires 11e9×103 = 1.1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2×175e9×
(41 × 98 + 103) = 1.4e15 FLOPs – more than 3 orders of magnitude more.

My question is: why is the median input sequence length used for the FLOPs estimate instead of the mean?

I understand that a dataset can have outliers in length, but I’m curious whether using the median is common practice.

Thanks!

From Colin Raffel internally:

Yeah, the mean can be a bit weird for sequence length since it’s a heavy-tailed distribution with lots of outliers (not normally distributed). I think in this case the median and mean were similar and we just used the median since it’s an int.