LLM performance

InXis · February 17, 2025, 8:39pm

Need help understanding published performance data for latest LLMs (o1-mini, deepseek, gemini, gpt, …)?

Is there a standard hardware configuration (1xH100) used to compare LLMs performance?

Are LLMs performance benchmarks based on theoretical/calculated speed?

usually no hyperparameters, no hardware specifications, no input data considerations are published (only, …, we are the fastest)

My experience with 4090 on Windows with 7B models (no quant) - usually around 1 token per second on 4090.

Is the peformance much better on Linux (Nvidia Triton performance optimizations and other LInux only libraries), how much better?

As an theoretical example:
NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.

Maximum expected speed is 1000ms / 27ms = 37 tokens per second (ideal situation) for 16-bit weights?

Thank You

John6666 · February 18, 2025, 3:05am

There is no unified attempt yet. However, there are quite a few benchmarks and leaderboards. I hope this is helpful…

Topic		Replies	Views
Speed expectations for production BERT models on CPU vs GPU? Beginners	1	2234	October 2, 2020
Organization Pricing Beginners	1	416	February 22, 2021
Performance with new NVIDIA RTX 30 series 🤗Transformers	4	5990	October 7, 2020
Inference speed Spaces	0	376	September 17, 2023
[Help] GPU with query answering 🤗Transformers	0	331	November 25, 2020

LLM performance

Related topics