Question about gradient checkpointing and use_cache

WenyangHui · July 9, 2024, 6:02am

I want to use the past_key_values to compute some losses during training, but gradient checkpointing forbids use_cache in CLMs. I wonder why there is such a conflict.

RaushanTurganbay · July 9, 2024, 2:12pm

I think it’s because gradient checkpointing forces us to do several forward passes to compute the gradients again. If we have use past_key_values in such a scenario, the cache will be filled with repeting keys/values from extra forward calls.

system · July 10, 2024, 4:11am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is use_cache incompatible with gradient checkpointing? 🤗Transformers	6	13406	September 15, 2023
Using gradient checkpointing and KV caching when generation happens in no grad context 🤗Transformers	2	243	September 28, 2024
Can we use Gradient Checkpointing and Gradient Accumulation at Once? 🤗Transformers	1	1229	September 14, 2021
Accuracy drops using Gradient checkpointing 🤗Transformers	0	149	September 7, 2023
Gradient checkpointing without training Beginners	0	239	July 18, 2023

Question about gradient checkpointing and use_cache

Related topics