Model pre-training precision database: fp16, fp32, bf16

There is an emerging need to know how a given model was pre-trained: fp16, fp32, bf16. So one won’t try to use fp32-pretrained model in fp16 regime. And most recently we are bombarded with users attempting to use bf16-pretrained (bfloat16!) models under fp16, which is very problematic since fp16 and bf16 numerical ranges don’t overlap too well.

We are discussing adding a new field to :hugs: models that will tell users how it was trained.

Some papers don’t disclose how a model was trained. So perhaps other ways can be found.

I made this into a wiki post, so if you can help to compile the knowledge base - and you know for sure which mode it belongs too and you can cite a reference - please add an entry below.

Notes:

  • Let’s focus on official models here - i.e. ones with papers (since some models have hundreds of derived/finetuned checkpoints)
  • One entry per model is enough unless you know that different checkpoints were trained with different precision which is for example the case with EleutherAI/gpt-neo checkpoints (one in bf16 and another in fp32)
  • Typically if a model was trained on TPU v2 or higher it’s almost sure it’s bfloat16.
  • We are looking for definitive data with references where it clearly states how the model was trained. If you are not sure then please don’t add anything.

Thank you!

This is a WIKI post so please add the data directly

Precision of Pre-trained Models (Wiki)

float16 (mixed precision)

  • allenai/longformer - paper, “we employed mixed precision training (floating points 16 and 32) using apex12 to reduce memory consumption and speed-up training. However, we kept the attention computation in fp32 to avoid numerical instability issues.”
  • allenai/led - same as allenai/longformer
  • lvwerra/codeparrot - informed by the creator of the model
  • facebook/m2m100_418M (and others) train info
  • eleutherai/gpt-neox-20b (doesn’t exist yet, but including for the sake of future-proofing) - as shown in the configs. The paper also states that the model was in fp16, see “Appendix B: Full Configuration Details.” Finally, Stella Biderman’s official announcement on Twitter also includes a link to download both “full” weights and “slim” weights which implies mixed precision was used.

bfloat16 (mixed precision)

  • google/mobilebert - paper, “we train IB-BERTLARGE on 256 TPU v3 chips”
  • eleutherai/gpt-neo-1.3b - shown in the config file
  • eleutherai/gpt-j-6b - shown in the GitHub readme
  • google/pegasus-cnn_dailymail - XXX: needs reference
  • google/pegasus-xsum - XXX: needs reference
  • google/mt5 - most likely same as t5
  • t5 - paper “TPU v3 chips”
  • bigscience/T0 and other T0* models (trained on TPUs, confirmed on bigscience slack)

float32 (full precision)

Please keep your comments on topic, it should be easy to start a new thread if you have related questions/issues to discuss.

8 Likes

I’ve updated the listings for EleutherAI models. I decided to include GPT-NeoX 20B in the list for the sake of future-proofing, and so I don’t need to come back and re-document it.

T5 is listed as being trained with mixed precision, but I had a conversation with Colin Raffel today that implies that it was trained in fp32. I will follow up and seek documentation.

1 Like

This was a miscommunication - T5 was trained in mixed precision.

1 Like

Thank you for following through, Stella! That’s very helpful!

Please keep your comments on topic, it should be easy to start a new thread if you have related questions/issues to discuss.

If possible please remove the unrelated comments from here and start new threads, as off-topic only invites more unrelated comments. Thank you