Total_flos vs C = 6 * N * D

ricdomolm · December 13, 2023, 12:02am

As far as I am aware, total_flos, returned by trainer.train, is supposed to indicate the total number of floating point operations performed to train some given model. However, this value can vary substantially compared to that of the commonly used formula C \approx 6 * N * D, where N is number of model parameters and D is model size.

For instance, as a quick experiment, training a freshly initialized 106M parameter Llama model on 52M tokens results in total_flos=2.7e16, whereas using the formula we obtain C=3.3e16 FLOPS.

My question is whether the returned total_flos is indeed accurate (specially for Llama-type models).

Tagging @teven since I believe he wrote the relevant code.

ricdomolm · December 13, 2023, 12:31pm

Currently, trainer does not include the embedding parameters in the FLOP computation, which is the source of the discrepancy observed

Topic		Replies	Views
What does total flos mean in Train Output? Beginners	1	1308	December 2, 2023
What is total_flos in the run summary? Beginners	1	2601	August 24, 2023
Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE) Models	0	57	April 23, 2025
Understanding FLOPs-per-token estimates from OpenAI's scaling laws Research	6	16365	September 20, 2023
Finetuning quantised llama-2 with LoRA Beginners	1	5609	September 23, 2023

Total_flos vs C = 6 * N * D

Related topics