What is the purpose of 'use_cache' in decoder?

why is it that use_cache isn’t compatible with gradient checkpointing? use_cache is just for generation, and there are no gradients during generation. @ybelkada maybe, @muellerzr :pray: