Why does hugging face falcon model use mode.config.use_cache = False, why wouldn't it want to have the decoder re-use computations for fine-tuning?

Most of us use free colab and kaggle gpus, so yeah we need to save as much vram as we can to be able to finetune a pretrained model, but if some1 use A100 , so yes i believe the cache should be truned on, also upcast the model norms layers float 32 for better training stability