Why is use_cache incompatible with gradient checkpointing?

I see the below snippet in modeling_t5.py. I wanted to understand why use_cache is incompatible with gradient checkpointing.


Hi, there. I face the same problem in run_clm.py when I set --gradient_checkpointing true. However I do not find any config that I can set on run_clm.py, does anyone know?

Anybody knows how to fix this?