As far as I am aware, total_flos
, returned by trainer.train
, is supposed to indicate the total number of floating point operations performed to train some given model. However, this value can vary substantially compared to that of the commonly used formula C \approx 6 * N * D, where N is number of model parameters and D is model size.
For instance, as a quick experiment, training a freshly initialized 106M parameter Llama model on 52M tokens results in total_flos=2.7e16, whereas using the formula we obtain C=3.3e16 FLOPS.
My question is whether the returned total_flos
is indeed accurate (specially for Llama-type models).
Tagging @teven since I believe he wrote the relevant code.