Token per second calculations

almersawi · November 10, 2023, 12:31pm

Hi! I’m trying to calculate the number of token per second that I expect to get from “llama 7b” model deployed on A10G (31.52 TFLOPS for FP16).

I know that the number of tokens = (TFLOPS / (2 * number of model parameters))

When I do the calculations I found that

no_of_tokens = (31.52 * 10e12) / (2 * 7 * 10e9) = 2251.4285714285716 tokens / second

but what I get from the model is approximately 32 token / second.

Am I missing something?

Diogo · November 8, 2024, 9:27pm

Dude I am at the same spot… Did you found out what were you missing?

wenxi · April 20, 2025, 5:57am

It could be due to Memory Bandwidth Limitations.

A simple operation for the 7B model might require loading all 14GB of parameters. At 600 GB/s memory bandwidth, just loading all parameters once would take: 14GB ÷ 600GB/s = 0.023 seconds. This translates to a theoretical maximum of ~43 operations per second involving all parameters

Topic		Replies	Views
Calculate tokens per second while fine-tuning llm? DeepSpeed	0	148	September 17, 2024
Estimating tokens per second Inference Endpoints on the Hub	3	8843	June 27, 2023
How much memory required to load T0pp Models	4	3733	October 20, 2021
How to calculate tokens per second while fine-tuning llm? 🤗Transformers	1	1691	September 12, 2024
Speed expectations for production BERT models on CPU vs GPU? Beginners	1	2256	October 2, 2020

Token per second calculations

Related topics