As far as I am aware,
total_flos, returned by
trainer.train, is supposed to indicate the total number of floating point operations performed to train some given model. However, this value can vary substantially compared to that of the commonly used formula C \approx 6 * N * D, where N is number of model parameters and D is model size.
For instance, as a quick experiment, training a freshly initialized 106M parameter Llama model on 52M tokens results in total_flos=2.7e16, whereas using the formula we obtain C=3.3e16 FLOPS.
My question is whether the returned
total_flos is indeed accurate (specially for Llama-type models).
Tagging @teven since I believe he wrote the relevant code.