There is an emerging need to know how a given model was pre-trained: fp16, fp32, bf16. So one won’t try to use fp32-pretrained model in fp16 regime. And most recently we are bombarded with users attempting to use bf16-pretrained (bfloat16!) models under fp16, which is very problematic since fp16 and bf16 numerical ranges don’t overlap too well.
We are discussing adding a new field to models that will tell users how it was trained.
Some papers don’t disclose how a model was trained. So perhaps other ways can be found.
I made this into a wiki post, so if you can help to compile the knowledge base - and you know for sure which mode it belongs too and you can cite a reference - please add an entry below.
Notes:
- Let’s focus on official models here - i.e. ones with papers (since some models have hundreds of derived/finetuned checkpoints)
- One entry per model is enough unless you know that different checkpoints were trained with different precision which is for example the case with
EleutherAI/gpt-neo
checkpoints (one in bf16 and another in fp32) - Typically if a model was trained on TPU v2 or higher it’s almost sure it’s
bfloat16
. - We are looking for definitive data with references where it clearly states how the model was trained. If you are not sure then please don’t add anything.
Thank you!
This is a WIKI post so please add the data directly
Precision of Pre-trained Models (Wiki)
float16 (mixed precision)
-
allenai/longformer
- paper, “we employed mixed precision training (floating points 16 and 32) using apex12 to reduce memory consumption and speed-up training. However, we kept the attention computation in fp32 to avoid numerical instability issues.” -
allenai/led
- same asallenai/longformer
- lvwerra/codeparrot - informed by the creator of the model
-
facebook/m2m100_418M
(and others) train info -
eleutherai/gpt-neox-20b
(doesn’t exist yet, but including for the sake of future-proofing) - as shown in the configs. The paper also states that the model was in fp16, see “Appendix B: Full Configuration Details.” Finally, Stella Biderman’s official announcement on Twitter also includes a link to download both “full” weights and “slim” weights which implies mixed precision was used.
bfloat16 (mixed precision)
-
google/mobilebert
- paper, “we train IB-BERTLARGE on 256 TPU v3 chips” -
eleutherai/gpt-neo-1.3b
- shown in the config file -
eleutherai/gpt-j-6b
- shown in the GitHub readme -
google/pegasus-cnn_dailymail
- XXX: needs reference -
google/pegasus-xsum
- XXX: needs reference -
google/mt5
- most likely same as t5 -
t5
- paper “TPU v3 chips” -
bigscience/T0
and otherT0*
models (trained on TPUs, confirmed on bigscience slack)
float32 (full precision)
-
EleutherAI/gpt-neo-2.7B
- the model’s config file doesn’t specify precision and the codebase defaults to fp32 -
gsarti/it5-base and other
it5-*
- stated by creator (JAX-trained)
Please keep your comments on topic, it should be easy to start a new thread if you have related questions/issues to discuss.