Sharing the answer internally from Thomas Wang:
How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?
Yes to the latter question
In particular, how is d_embd converted into one of the other known variables that make up N?
d_embd == d_model
Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?
Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).