Why is use_cache incompatible with gradient checkpointing?

ratishsp · June 7, 2022, 10:25am

Hi,
I see the below snippet in modeling_t5.py. I wanted to understand why use_cache is incompatible with gradient checkpointing.

github.com

huggingface/transformers/blob/9aa230aa2fb17e4a7da4c61fccaddafe410f7ed1/src/transformers/models/t5/modeling_t5.py#L1008-L1011

      
        
            if use_cache:
                logger.warning(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )

Thanks.

lianghsun · June 8, 2022, 5:46am

Hi, there. I face the same problem in run_clm.py when I set --gradient_checkpointing true. However I do not find any config that I can set on run_clm.py, does anyone know?

luisarmando · October 27, 2022, 9:11pm

Anybody knows how to fix this?

rs2992 · March 24, 2023, 4:33pm

Honestly, I’ve just ignored it. It automatically disables the use_cache function. I’m about to remove the warning altogether.

use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False

kira · July 7, 2023, 12:44pm

its related to past_key_values, you can disable this warning by setting
model.config.use_cache = False, when using gradient checkpointing during training.
but during inference make sure to set it back to True.

mmiakashs · August 25, 2023, 9:32pm

Could you please share what are the purposes of use_cache? Thanks.

kira · September 15, 2023, 1:39am

Topic		Replies	Views
Question about gradient checkpointing and use_cache Beginners	2	610	July 9, 2024
Using gradient checkpointing and KV caching when generation happens in no grad context 🤗Transformers	2	236	September 28, 2024
Using gradient_checkpointing=True in Trainer causes error with LLaMA 🤗Transformers	1	2458	July 8, 2023
Gradient_checkpointing = True results in error 🤗Transformers	3	8549	February 22, 2023
Model.generate use_cache=True generates different results than use_cache=False Intermediate	3	182	March 4, 2025

Why is use_cache incompatible with gradient checkpointing?

Related topics