Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE)

Hi everyone,

I’m currently working with the meta-llama/Llama-4-Scout-17B-16E model for text generation and analyzing its performance. To estimate theoretical compute requirements, I’ve been using the following simplified formula for FLOPs/token:

Estimated Simplified FLOPs/token (in GFLOPs) = 2N + 2 * L * S * H

Where:

  • N = Total model parameters (in billions)
  • L = Number of transformer layers
  • S = Sequence length (input + output)
  • H = Hidden size

However, I’m noticing a significant discrepancy between this estimated value and the measured average FLOPs/token calculated as:

Measured FLOPs/token = Total_FLOPs / (batch_size * (input_len + output_len))
Here, I have calculated Total_FLOPS using pytorch profiler.

Even accounting for sparsity due to Mixture-of-Experts (MoE) (e.g., 1 out of 16 experts active), the measured FLOPs/token is much lower than the simplified theoretical value. This is especially true when running inference with standard batch sizes and sequence lengths.

I’m using LLaMA-4 purely for text generation, so the visual transformer part of the configuration (present in the model’s config) shouldn’t be active. I’d like to confirm:

  • Is the simplified formula inherently overestimating for MoE models?
  • Does the sparsity in MoE layers (1/16 experts) reduce compute more drastically than reflected in the formula?
  • Are there internal optimizations (e.g., kv-caching, fused ops) that significantly reduce measured FLOPs/token?

Any insights or corrections to the formula (specific to MoE-based decoder-only models like LLaMA-4) would be greatly appreciated.

Thanks

1 Like