Memory footprint in mixed precision training?

xwjiang2010 · June 28, 2023, 9:11pm

Hi there,
I really love the great documentations from the Huggingface team. Appreciate the effort here!
I have one question regarding the theoretical memory footprint with mixed precision training. Here it says

Model Weights
4 bytes * number of parameters for fp32 training
6 bytes * number of parameters for mixed precision training

My question is why for mixed precision we have 2 bytes more (presumably for the fp16 format weights)? Can’t we just always compute fp16 format whenever needed from fp32?

Thanks for the help!

mapama247 · June 29, 2023, 11:34am

Hello!

The reason is that when training with mixed precision some floating-point operations are performed with only 16b of precision, while 32b are used in critical parts of the network to ensure numeric stability. This way it is possible to speed up training without impairing the learning process.

This means that two copies of the model must be stored in order to have its weights both in full precision (32b) and half precision (16b). In full precision the each model parameter takes 4 bytes (4*8=32) while in half precision only 2 bytes (2*8=16). So the total memory required to store the model is 4B*num_params + 2B*num_params = 6B*num_params

Topic		Replies	Views
When I use Trainer API to train the GLM Model and save this model，I find memory of the finetuned model is twice the size of the original model. What is the reason for this? Models	5	317	March 22, 2023
Can I use fp16 model for mixed precision training? 🤗Transformers	0	295	January 16, 2024
Why does setting `--fp16 True` not save memory as expected? 🤗Transformers	2	2613	September 9, 2022
Model pre-training precision database: fp16, fp32, bf16 🤗Transformers	4	7044	December 3, 2022
Mixed Precision training (fp16), how to use in production? 🤗Transformers	1	921	July 7, 2022

Memory footprint in mixed precision training?

Related topics