Understanding FLOPs-per-token estimates from OpenAI's scaling laws

lewtun · September 14, 2022, 1:56pm

Sharing the answer internally from Thomas Wang:

How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?

Yes to the latter question

In particular, how is d_embd converted into one of the other known variables that make up N?

d_embd == d_model

Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?

Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).

Topic		Replies	Views
Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE) Models	0	57	April 23, 2025
Why OPT's token embeddings are not scaled by sqrt(dim) as in the original OPT implementation? 🤗Transformers	3	318	February 8, 2023
Token per second calculations Intermediate	2	2523	April 20, 2025
Time complexity of the generate method in transformer library (using beam search) 🤗Transformers	0	413	December 9, 2021
Total_flos vs C = 6 * N * D 🤗Transformers	1	730	December 13, 2023

Understanding FLOPs-per-token estimates from OpenAI's scaling laws

Related topics