Need help understanding published performance data for latest LLMs (o1-mini, deepseek, gemini, gpt, …)?
Is there a standard hardware configuration (1xH100) used to compare LLMs performance?
Are LLMs performance benchmarks based on theoretical/calculated speed?
- usually no hyperparameters, no hardware specifications, no input data considerations are published (only, …, we are the fastest)
My experience with 4090 on Windows with 7B models (no quant) - usually around 1 token per second on 4090.
Is the peformance much better on Linux (Nvidia Triton performance optimizations and other LInux only libraries), how much better?
As an theoretical example:
NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.
Maximum expected speed is 1000ms / 27ms = 37 tokens per second (ideal situation) for 16-bit weights?
Thank You