There is an emerging need to know how a given model was pre-trained: fp16, fp32, bf16. So one won’t try to use fp32-pretrained model in fp16 regime. And most recently we are bombarded with users attempting to use bf16-pretrained (bfloat16!) models under fp16, which is very problematic since fp16 and bf16 numerical ranges don’t overlap too well.
We are discussing adding a new field to models that will tell users how it was trained.
Some papers don’t disclose how a model was trained. So perhaps other ways can be found.
I made this into a wiki post, so if you can help to compile the knowledge base - and you know for sure which mode it belongs too and you can cite a reference - please add an entry below.
- Let’s focus on official models here - i.e. ones with papers (since some models have hundreds of derived/finetuned checkpoints)
- One entry per model is enough unless you know that different checkpoints were trained with different precision which is for example the case with
EleutherAI/gpt-neocheckpoints (one in bf16 and another in fp32)
- Typically if a model was trained on TPU v2 or higher it’s almost sure it’s
- We are looking for definitive data with references where it clearly states how the model was trained. If you are not sure then please don’t add anything.
This is a WIKI post so please add the data directly
Precision of Pre-trained Models (Wiki)
float16 (mixed precision)
allenai/longformer- paper, “we employed mixed precision training (floating points 16 and 32) using apex12 to reduce memory consumption and speed-up training. However, we kept the attention computation in fp32 to avoid numerical instability issues.”
allenai/led- same as
- lvwerra/codeparrot - informed by the creator of the model
facebook/m2m100_418M(and others) train info
eleutherai/gpt-neox-20b(doesn’t exist yet, but including for the sake of future-proofing) - as shown in the configs. The paper also states that the model was in fp16, see “Appendix B: Full Configuration Details.” Finally, Stella Biderman’s official announcement on Twitter also includes a link to download both “full” weights and “slim” weights which implies mixed precision was used.
bfloat16 (mixed precision)
google/mobilebert- paper, “we train IB-BERTLARGE on 256 TPU v3 chips”
eleutherai/gpt-neo-1.3b- shown in the config file
eleutherai/gpt-j-6b- shown in the GitHub readme
google/pegasus-cnn_dailymail- XXX: needs reference
google/pegasus-xsum- XXX: needs reference
google/mt5- most likely same as t5
t5- paper “TPU v3 chips”
T0*models (trained on TPUs, confirmed on bigscience slack)
float32 (full precision)
EleutherAI/gpt-neo-2.7B- the model’s config file doesn’t specify precision and the codebase defaults to fp32
gsarti/it5-base and other
it5-*- stated by creator (JAX-trained)
Please keep your comments on topic, it should be easy to start a new thread if you have related questions/issues to discuss.