Most of us use free colab and kaggle gpus, so yeah we need to save as much vram as we can to be able to finetune a pretrained model, but if some1 use A100 , so yes i believe the cache should be truned on, also upcast the model norms layers float 32 for better training stability