Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE)

Arunima693 · April 23, 2025, 6:51am

Hi everyone,

I’m currently working with the meta-llama/Llama-4-Scout-17B-16E model for text generation and analyzing its performance. To estimate theoretical compute requirements, I’ve been using the following simplified formula for FLOPs/token:

Estimated Simplified FLOPs/token (in GFLOPs) = 2N + 2 * L * S * H

Where:

N = Total model parameters (in billions)
L = Number of transformer layers
S = Sequence length (input + output)
H = Hidden size

However, I’m noticing a significant discrepancy between this estimated value and the measured average FLOPs/token calculated as:

Measured FLOPs/token = Total_FLOPs / (batch_size * (input_len + output_len))
Here, I have calculated Total_FLOPS using pytorch profiler.

Even accounting for sparsity due to Mixture-of-Experts (MoE) (e.g., 1 out of 16 experts active), the measured FLOPs/token is much lower than the simplified theoretical value. This is especially true when running inference with standard batch sizes and sequence lengths.

I’m using LLaMA-4 purely for text generation, so the visual transformer part of the configuration (present in the model’s config) shouldn’t be active. I’d like to confirm:

Is the simplified formula inherently overestimating for MoE models?
Does the sparsity in MoE layers (1/16 experts) reduce compute more drastically than reflected in the formula?
Are there internal optimizations (e.g., kv-caching, fused ops) that significantly reduce measured FLOPs/token?

Any insights or corrections to the formula (specific to MoE-based decoder-only models like LLaMA-4) would be greatly appreciated.

Thanks

Topic		Replies	Views
Token per second calculations Intermediate	2	2523	April 20, 2025
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6007	April 14, 2024
Understanding FLOPs-per-token estimates from OpenAI's scaling laws Research	6	16363	September 20, 2023
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3236	September 22, 2023
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	56	June 10, 2025

Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE)

Related topics