Token per second calculations

Hi! I’m trying to calculate the number of token per second that I expect to get from “llama 7b” model deployed on A10G (31.52 TFLOPS for FP16).

I know that the number of tokens = (TFLOPS / (2 * number of model parameters))

When I do the calculations I found that

no_of_tokens = (31.52 * 10e12) / (2 * 7 * 10e9) = 2251.4285714285716 tokens / second

but what I get from the model is approximately 32 token / second.

Am I missing something?

3 Likes

Dude I am at the same spot… Did you found out what were you missing?

1 Like

It could be due to Memory Bandwidth Limitations.

A simple operation for the 7B model might require loading all 14GB of parameters. At 600 GB/s memory bandwidth, just loading all parameters once would take: 14GB ÷ 600GB/s = 0.023 seconds. This translates to a theoretical maximum of ~43 operations per second involving all parameters

1 Like