Hi folks,

I’m trying to compare FLOPs-per-token for various Transformer architectures and came across the estimates formulas provided in OpenAI’s scaling laws paper.

In a nutshell, they claim that the **forward pass of decoder-only Transformers involves \approx 2N add-multiply operations**, where N is the number of non-embedding parameters in the model.

For a given input sequence length S, this nifty result allows one to estimate the inference costs of decoder-only models as \approx N \times S FLOPs-per-token.

The estimate for the number of add-multiply operations comes from Table 1 of their paper:

My question is:

How exactly is the equation for C_\mathrm{forward} derived? Is it the sum of all rows in the table or something else?

In particular, how is d_\mathrm{embd} converted into one of the other known variables that make up N? Similarly, is the “De-embed” estimate 2d_\mathrm{model}n_\mathrm{vocab} excluded from the calculation (I know the Embed one is)?

Thanks!