I am trying to understand the reasons behind the design choice of computing activations memory through running an experiment rather formulating it in deepspeed autotuner.
I am trying to understand the reasons behind the design choice of computing activations memory through running an experiment rather formulating it in deepspeed autotuner.