Understanding FLOPs-per-token estimates from OpenAI's scaling laws

Hi folks,

I’m trying to compare FLOPs-per-token for various Transformer architectures and came across the estimates formulas provided in OpenAI’s scaling laws paper.

In a nutshell, they claim that the forward pass of decoder-only Transformers involves \approx 2N add-multiply operations, where N is the number of non-embedding parameters in the model.

For a given input sequence length S, this nifty result allows one to estimate the inference costs of decoder-only models as \approx N \times S FLOPs-per-token.

The estimate for the number of add-multiply operations comes from Table 1 of their paper:

My question is:

How exactly is the equation for C_\mathrm{forward} derived? Is it the sum of all rows in the table or something else?

In particular, how is d_\mathrm{embd} converted into one of the other known variables that make up N? Similarly, is the “De-embed” estimate 2d_\mathrm{model}n_\mathrm{vocab} excluded from the calculation (I know the Embed one is)?

Thanks!

Sharing the answer internally from Thomas Wang:

How exactly is the equation for *C_*forward derived? Is it the sum of all rows in the table or something else?

Yes to the latter question

In particular, how is d_embd converted into one of the other known variables that make up N?

d_embd == d_model

Similarly, is the “De-embed” estimate 2 * d_model * n_vocab excluded from the calculation (I know the Embed one is)?

Yes(sorry it’s a bit hard to write math, but essentially for QKV/Project/FF, if parameters is P, then FLOPs per token is 2P). Consequently if you add everything, you end up with N parameters and 2N FLOPs per token (and then you add masking).

I apologize if this should be obvious, but just to clarify, this computation is for a single output token? So if I were trying to generate a response, for example, from a chat-bot, I would expect to pay this computation cost for every token until a stop token was generated?

Yes, that’s right - these estimates are just for the forward and backward passes, so you’d have to factor in the extra cost for the decoding algorithm (beam search vs sampling etc)

1 Like

First, thanks a lot for the response. I really appreciate your sharing your insight. This answer seems right to me at first glance, but it leads me to a conclusion that I can’t make sense out of so… maybe there is more to the story? If a network that accepts an input window of size S, and having N parameters takes O(NS) operations to produce a single output token, then, logically, it would seem that it would take O(NS*M) operations to produce a response of length M.

What confuses me is that people like OpenAI, as well as others running these “model as a service” sort of paid APIs charge for tokens and, in every case I see, they charge a price for k number of tokens, and count both your input and output tokens. This means that the cost to you is proportional to N+M while the cost to them is proportional to N*M. That seems like a pretty badly losing business proposition for them.

Is there some what to effectively reuse the computation used for the prior token? What am I missing?

p.s. a little back of the envelope calculation says that if I were to run a GPT-3 sized network on AWS, for a 2k token input window (which I believe is correct for GPT-3) and a 1k token output, perhaps in some chat setting, and for an (unlikely) beam width of 1, using this NSM model of FLOPs, then it would take me something like 2048175Bn1024=0.00035 ZFLOPs of computation (theoretical, ignoring the impact of efficiency of GPUs). At current prices, for a hoard of 8-way A100 servers, you will pay about $12/hr. each (and that’s the spot rate!) which, after a little crunching, gives something like $1250/ZFLOP. So, putting this together, we get $0.43/query. In contrast, last I checked, OpenAI’s rate for tokens on GPT-3 was something like 1k tokens for $0.06 or $0.18 for the scenario above. Are they really renting out GPT-3 for half the cost of operation? Seems unlikely. Obviously, I could be making some sophomoric mistake here, but… seems like there is a problem.

1 Like

Well, this question I posted did not get a response and now, a bit later, I think I have a pretty good answer so, in the interest of posterity, I thought I’d post it here for the benefit of others.

First, shout out to Jay Alammar and his great post breaking down the workings of GPT-2. The analysis there generalizes to similar networks.

Basically, I was incorrect in the idea that all of the prior tokens in the window needed to be analyzed for every new token. This is because, once a “meaning” is assigned to a token by passing it up through the transformer stack, this meaning will not be revisited. The Key and Value vectors will be retained however, at every level, so the computation of subsequent tokens will need to compute increasing numbers of dot products in the attention blocks. However, this presents an insignificant number of operations, even for very large window sizes, compared to the number of operations in the application of the weights.

Thus, as a result, for a query sent to a chat-bot like network, every token in the query is processed, and a similar amount of work needs to be done for every token in the response. The number of operations is, consistent with the comment above about model-as-a-service pricing, proportional to the number of weights in the network times the sum of the number of input and number of output tokens.

At least, this is my understanding thus far. If anyone sees something needing correcting, please do let me (and the world) know by adding to this thread.

2 Likes

(This comment might be superfluous, but a simple “like” didn’t express it well enough therealadrian :wink: )

I just wanted to thank you for the pointer, indeed this same question bugged me for a good while now, and I couldn’t understand how that was possible. I thought the total runtime of a decoder generation would scale with O(n^3), but indeed I was wrong, it seems like it’s “only” O(n^2), where indeed n is simply the sum of prompt and output (although in practice the output is likely slower to compute, since it can’t be parallelized as well on a GPU; and that’s likely why OpenAI charges more for output than for input, but “just” a factor 2 more).

For reference, this is indeed the same for all similar models, like GPT2, BLOOM, etc. - my early experiments with BLOOM puzzled me because runtime didn’t seem to depend on (short) prompt length almost at all, now I get why that is so.

The key part of that blog post which explains why it works so is that GPT2-like models only use masked attention, even in training. That is, in training, if you see the sentence “Hello world, how are you?”, then to compute the key, value, and query vectors for each of the tokens you only look at the previous ones. This of course makes sense because you want to use the outputs to predict the next token (and apply the loss), so you can’t cheat and look at the future; but a priori it’s not a given. You could, for example, use full attention over all the first words when predicting the last “?”, if you only applied loss to that token. In that case, key, query and value vectors of each token would depend on the whole sentence, and that would be okay. But in practice it would be a very bad idea, because the training would be very slow (instead of learning all next words at once, you’d only learn one). I guess that’s how it worked in LSTM times, and why transformers allowed massively more parallelized training. Plus, indeed, inference would be much slower.

To check experimentally that this is indeed the case, one can check that all the predictions (as well as intermediate layers) are identical when prompted with two sentences that only differ in the last token:

import numpy as np
from transformers import AutoTokenizer, GPT2Model
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
outputs = []
for text in ['This is an awesome prompt', 'This is an awesome feature']:
    encoded_input = tokenizer(text, return_tensors='pt')
    cuda_input = {k: v.to('cuda') for k, v in encoded_input.items()}
    outputs.append(model(**cuda_input))
print(np.isclose(outputs[0].last_hidden_state[0].cpu().numpy(), outputs[1].last_hidden_state[0].cpu().numpy()).all(axis=1))

This returns:

[ True  True  True  True False]

i.e. identical logits everywhere except for the very last token.