Understanding FLOPs-per-token estimates from OpenAI's scaling laws

Hi folks,

I’m trying to compare FLOPs-per-token for various Transformer architectures and came across the estimates formulas provided in OpenAI’s scaling laws paper.

In a nutshell, they claim that the forward pass of decoder-only Transformers involves \approx 2N add-multiply operations, where N is the number of non-embedding parameters in the model.

For a given input sequence length S, this nifty result allows one to estimate the inference costs of decoder-only models as \approx N \times S FLOPs-per-token.

The estimate for the number of add-multiply operations comes from Table 1 of their paper:

My question is:

How exactly is the equation for C_\mathrm{forward} derived? Is it the sum of all rows in the table or something else?

In particular, how is d_\mathrm{embd} converted into one of the other known variables that make up N? Similarly, is the “De-embed” estimate 2d_\mathrm{model}n_\mathrm{vocab} excluded from the calculation (I know the Embed one is)?

Thanks!

Sharing the answer internally from Thomas Wang:

How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?

Yes to the latter question

In particular, how is d_embd converted into one of the other known variables that make up N?

d_embd == d_model

Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?

Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).

I apologize if this should be obvious, but just to clarify, this computation is for a single output token? So if I were trying to generate a response, for example, from a chat-bot, I would expect to pay this computation cost for every token until a stop token was generated?

Yes, that’s right - these estimates are just for the forward and backward passes, so you’d have to factor in the extra cost for the decoding algorithm (beam search vs sampling etc)

1 Like

First, thanks a lot for the response. I really appreciate your sharing your insight. This answer seems right to me at first glance, but it leads me to a conclusion that I can’t make sense out of so… maybe there is more to the story? If a network that accepts an input window of size S, and having N parameters takes O(NS) operations to produce a single output token, then, logically, it would seem that it would take O(NS*M) operations to produce a response of length M.

What confuses me is that people like OpenAI, as well as others running these “model as a service” sort of paid APIs charge for tokens and, in every case I see, they charge a price for k number of tokens, and count both your input and output tokens. This means that the cost to you is proportional to N+M while the cost to them is proportional to N*M. That seems like a pretty badly losing business proposition for them.

Is there some what to effectively reuse the computation used for the prior token? What am I missing?

p.s. a little back of the envelope calculation says that if I were to run a GPT-3 sized network on AWS, for a 2k token input window (which I believe is correct for GPT-3) and a 1k token output, perhaps in some chat setting, and for an (unlikely) beam width of 1, using this NSM model of FLOPs, then it would take me something like 2048175Bn1024=0.00035 ZFLOPs of computation (theoretical, ignoring the impact of efficiency of GPUs). At current prices, for a hoard of 8-way A100 servers, you will pay about $12/hr. each (and that’s the spot rate!) which, after a little crunching, gives something like $1250/ZFLOP. So, putting this together, we get $0.43/query. In contrast, last I checked, OpenAI’s rate for tokens on GPT-3 was something like 1k tokens for $0.06 or $0.18 for the scenario above. Are they really renting out GPT-3 for half the cost of operation? Seems unlikely. Obviously, I could be making some sophomoric mistake here, but… seems like there is a problem.

Well, this question I posted did not get a response and now, a bit later, I think I have a pretty good answer so, in the interest of posterity, I thought I’d post it here for the benefit of others.

First, shout out to Jay Alammar and his great post breaking down the workings of GPT-2. The analysis there generalizes to similar networks.

Basically, I was incorrect in the idea that all of the prior tokens in the window needed to be analyzed for every new token. This is because, once a “meaning” is assigned to a token by passing it up through the transformer stack, this meaning will not be revisited. The Key and Value vectors will be retained however, at every level, so the computation of subsequent tokens will need to compute increasing numbers of dot products in the attention blocks. However, this presents an insignificant number of operations, even for very large window sizes, compared to the number of operations in the application of the weights.

Thus, as a result, for a query sent to a chat-bot like network, every token in the query is processed, and a similar amount of work needs to be done for every token in the response. The number of operations is, consistent with the comment above about model-as-a-service pricing, proportional to the number of weights in the network times the sum of the number of input and number of output tokens.

At least, this is my understanding thus far. If anyone sees something needing correcting, please do let me (and the world) know by adding to this thread.