why is it that use_cache
isnāt compatible with gradient checkpointing? use_cache
is just for generation, and there are no gradients during generation. @ybelkada maybe, @muellerzr
why is it that use_cache
isnāt compatible with gradient checkpointing? use_cache
is just for generation, and there are no gradients during generation. @ybelkada maybe, @muellerzr