I really love the great documentations from the Huggingface team. Appreciate the effort here!
I have one question regarding the theoretical memory footprint with mixed precision training. Here it says
4 bytes * number of parameters for fp32 training
6 bytes * number of parameters for mixed precision training
My question is why for mixed precision we have 2 bytes more (presumably for the fp16 format weights)? Can’t we just always compute fp16 format whenever needed from fp32?
Thanks for the help!
The reason is that when training with mixed precision some floating-point operations are performed with only 16b of precision, while 32b are used in critical parts of the network to ensure numeric stability. This way it is possible to speed up training without impairing the learning process.
This means that two copies of the model must be stored in order to have its weights both in full precision (32b) and half precision (16b). In full precision the each model parameter takes 4 bytes (4*8=32) while in half precision only 2 bytes (2*8=16). So the total memory required to store the model is
4B*num_params + 2B*num_params = 6B*num_params