LLM performance

Need help understanding published performance data for latest LLMs (o1-mini, deepseek, gemini, gpt, …)?

Is there a standard hardware configuration (1xH100) used to compare LLMs performance?

Are LLMs performance benchmarks based on theoretical/calculated speed?

  • usually no hyperparameters, no hardware specifications, no input data considerations are published (only, …, we are the fastest)

My experience with 4090 on Windows with 7B models (no quant) - usually around 1 token per second on 4090.

Is the peformance much better on Linux (Nvidia Triton performance optimizations and other LInux only libraries), how much better?

As an theoretical example:
NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.

Maximum expected speed is 1000ms / 27ms = 37 tokens per second (ideal situation) for 16-bit weights?

Thank You

1 Like

There is no unified attempt yet. However, there are quite a few benchmarks and leaderboards. I hope this is helpful…