Why does setting `--fp16 True` not save memory as expected?

skye95 · September 1, 2022, 10:58am

I want to pre-train Roberta on my dataset. However, the Batch size can be set to 32 at most. Otherwise, OOM is reported. I plan to use Mixed-precision to save memory. So I set --fp16 True . However, the Batch size can only be set to 32 at most. Otherwise, OOM will be reported. It seems that setting up FP16 is not doing much to save memory.

sgugger · September 1, 2022, 11:07am

Where did you read it would save you memory? Training with mixed precision will be faster, but does not save memory when you train large models, because instead of having 1 model in FP32 in GPU RAM, you get 1 copy in FP32 and 1 copy in FP16 (so 1.5 times the memory). You save a bit with the activations being in FP16 instead of FP32 but it’s not always enough to allow you to increase the batch size.

skye95 · September 9, 2022, 2:31am

Thanks!

We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used.

In this paper “Learning Transferable Visual Models From Natural Language Supervision”, it described “Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory”.

Topic		Replies	Views
Memory footprint in mixed precision training? Beginners	1	827	June 29, 2023
Question about FP16/32, LoRA and GPU Memory Usage 🤗Transformers	1	3769	September 18, 2023
GPU OOM when training Beginners	2	3231	October 20, 2021
Mixed Precision training (fp16), how to use in production? 🤗Transformers	1	924	July 7, 2022
Why are huge batch sizes used for pretraining and small ones for finetuning? Research	3	10283	January 10, 2023

Why does setting `--fp16 True` not save memory as expected?

Related topics