Understanding FLOPs-per-token estimates from OpenAI's scaling laws

Hi folks,

I’m trying to compare FLOPs-per-token for various Transformer architectures and came across the estimates formulas provided in OpenAI’s scaling laws paper.

In a nutshell, they claim that the forward pass of decoder-only Transformers involves \approx 2N add-multiply operations, where N is the number of non-embedding parameters in the model.

For a given input sequence length S, this nifty result allows one to estimate the inference costs of decoder-only models as \approx N \times S FLOPs-per-token.

The estimate for the number of add-multiply operations comes from Table 1 of their paper:

My question is:

How exactly is the equation for C_\mathrm{forward} derived? Is it the sum of all rows in the table or something else?

In particular, how is d_\mathrm{embd} converted into one of the other known variables that make up N? Similarly, is the “De-embed” estimate 2d_\mathrm{model}n_\mathrm{vocab} excluded from the calculation (I know the Embed one is)?

Thanks!

Sharing the answer internally from Thomas Wang:

How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?

Yes to the latter question

In particular, how is d_embd converted into one of the other known variables that make up N?

d_embd == d_model

Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?

Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).