Hi! I’m trying to calculate the number of token per second that I expect to get from “llama 7b” model deployed on A10G (31.52 TFLOPS for FP16).
I know that the number of tokens = (TFLOPS / (2 * number of model parameters))
When I do the calculations I found that
no_of_tokens = (31.52 * 10e12) / (2 * 7 * 10e9) = 2251.4285714285716 tokens / second
but what I get from the model is approximately 32 token / second.
Am I missing something?