During training on GPUs, what are the exact parameters which take up the GPU memory? From my understanding, I think these are the ones:
- The model weights (and bias terms) which need to be loaded into memory.
- The (batch size x input_seq_length) for the inputs.
- The activations when using forward pass.
- The gradients when doing backward pass.
- The optimizer states assuming adam (first moment for every weight param, second moment for every weight param).
Did I miss something else? Please shed some light if something is wrong or I missed out something. Would like to understand what is consuming GPU memory.
In the ZeRO presentation (KDD2020), they say if there are 1 billion params, there will be 16 billion params in memory if we use adam. How is that?