Params stored in the GPU during training

During training on GPUs, what are the exact parameters which take up the GPU memory? From my understanding, I think these are the ones:

  1. The model weights (and bias terms) which need to be loaded into memory.
  2. The (batch size x input_seq_length) for the inputs.
  3. The activations when using forward pass.
  4. The gradients when doing backward pass.
  5. The optimizer states assuming adam (first moment for every weight param, second moment for every weight param).

Did I miss something else? Please shed some light if something is wrong or I missed out something. Would like to understand what is consuming GPU memory.

In the ZeRO presentation (KDD2020), they say if there are 1 billion params, there will be 16 billion params in memory if we use adam. How is that?

@sgugger can you help me with this?