Understanding FLOPs-per-token estimates from OpenAI's scaling laws

Sharing the answer internally from Thomas Wang:

How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?

Yes to the latter question

In particular, how is d_embd converted into one of the other known variables that make up N?

d_embd == d_model

Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?

Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).