Best practices for estimating FLOPs-per-token with real datasets?

lewtun · September 20, 2022, 12:40pm

Hi folks,

I’m currently reading the T-Few paper on few-shot learning and in section 4.2 they provide a table and estimate of the 11B parameter model’s inference costs as follows:

Screen Shot 2022-09-20 at 14.37.53

We summarize the costs in table 1 and discuss them below. For all estimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation and our unlikelihood loss both require processing every possible output choice to attain a prediction for an unlabeled example. The median combined tokenized sequence length for the input and all possible targets is 103 for the datasets we consider. … Processing a single input and all target choices with T-Few requires 11e9×103 = 1.1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2×175e9×
(41 × 98 + 103) = 1.4e15 FLOPs – more than 3 orders of magnitude more.

My question is: why is the median input sequence length used for the FLOPs estimate instead of the mean?

I understand that a dataset can have outliers in length, but I’m curious whether using the median is common practice.

Thanks!

lewtun · September 20, 2022, 1:44pm

From Colin Raffel internally:

Yeah, the mean can be a bit weird for sequence length since it’s a heavy-tailed distribution with lots of outliers (not normally distributed). I think in this case the median and mean were similar and we just used the median since it’s an int.

Topic		Replies	Views
Understanding FLOPs-per-token estimates from OpenAI's scaling laws Research	6	17797	September 20, 2023
Inference slows down after restrictions 🤗Transformers	0	205	March 22, 2021
Finetuning Sequence-Pairs (GLUE) with higher sequence lengths seems to fail? Beginners	1	626	December 4, 2020
Total_flos vs C = 6 * N * D 🤗Transformers	1	846	December 13, 2023
Benchmark results 🤗Transformers	1	761	July 19, 2020

Best practices for estimating FLOPs-per-token with real datasets?

Related topics