Total_flos vs C = 6 * N * D

As far as I am aware, total_flos, returned by trainer.train, is supposed to indicate the total number of floating point operations performed to train some given model. However, this value can vary substantially compared to that of the commonly used formula C \approx 6 * N * D, where N is number of model parameters and D is model size.

For instance, as a quick experiment, training a freshly initialized 106M parameter Llama model on 52M tokens results in total_flos=2.7e16, whereas using the formula we obtain C=3.3e16 FLOPS.

My question is whether the returned total_flos is indeed accurate (specially for Llama-type models).

Tagging @teven since I believe he wrote the relevant code.

Currently, trainer does not include the embedding parameters in the FLOP computation, which is the source of the discrepancy observed :slight_smile: