Params stored in the GPU during training

arunwzd · April 25, 2022, 6:35pm

During training on GPUs, what are the exact parameters which take up the GPU memory? From my understanding, I think these are the ones:

The model weights (and bias terms) which need to be loaded into memory.
The (batch size x input_seq_length) for the inputs.
The activations when using forward pass.
The gradients when doing backward pass.
The optimizer states assuming adam (first moment for every weight param, second moment for every weight param).

Did I miss something else? Please shed some light if something is wrong or I missed out something. Would like to understand what is consuming GPU memory.

In the ZeRO presentation (KDD2020), they say if there are 1 billion params, there will be 16 billion params in memory if we use adam. How is that?

arunwzd · April 27, 2022, 3:09pm

@sgugger can you help me with this?

Topic		Replies	Views
Parameters that contribute to GPU Memory Models	0	245	November 23, 2023
GPU memory usage is twice (2x) what I calculated based on number of parameters and floating point precision Intermediate	5	442	May 18, 2024
Estimate training compute for 150B LLM DeepSpeed	0	531	June 30, 2023
How to predict the memory requirements for a given model? Models	0	744	June 9, 2022
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	87	April 10, 2025

Params stored in the GPU during training

Related topics