Hi folks,

I’m currently reading the T-Few paper on few-shot learning and in section 4.2 they provide a table and estimate of the 11B parameter model’s inference costs as follows:

We summarize the costs in table 1 and discuss them below. For all estimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation and our unlikelihood loss both require processing every possible output choice to attain a prediction for an unlabeled example.

The median combined tokenized sequence length for the input and all possible targets is 103 for the datasets we consider.…Processing a single input and all target choices with T-Few requires 11e9×103 = 1.1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2×175e9×

(41 × 98 + 103) = 1.4e15 FLOPs – more than 3 orders of magnitude more.

My question is: why is the **median** input sequence length used for the FLOPs estimate instead of the **mean**?

I understand that a dataset can have outliers in length, but I’m curious whether using the median is common practice.

Thanks!