I want to pre-train Roberta on my dataset. However, the Batch size can be set to 32 at most. Otherwise, OOM is reported. I plan to use Mixed-precision
to save memory. So I set --fp16 True
. However, the Batch size can only be set to 32 at most. Otherwise, OOM will be reported. It seems that setting up FP16
is not doing much to save memory.
Where did you read it would save you memory? Training with mixed precision will be faster, but does not save memory when you train large models, because instead of having 1 model in FP32 in GPU RAM, you get 1 copy in FP32 and 1 copy in FP16 (so 1.5 times the memory). You save a bit with the activations being in FP16 instead of FP32 but it’s not always enough to allow you to increase the batch size.
Thanks!
We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used.
In this paper “Learning Transferable Visual Models From Natural Language Supervision”, it described “Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory”.