Gradient Checkpointing with external values

d3nigma · March 11, 2024, 12:05pm

I am checking out Octavius right now and they introduce network level routing with mixture of experts. Their router takes in the network input and calculates expert scores which are then passed down to the target modules, e.g. q_proj and v_proj.

My problem is that I don’t understand how this can work with gradient checkpointing enabled because I thought gradient checkpointing encapsules the decoder blocks. Thus, the scores which not passed down via forward but directly added as instance variables q_proj.scores = ... are not taken into account during the backward.

Is it possible to prevent his somehow?

Topic		Replies	Views
Gradient_checkpointing control 🤗Transformers	0	1066	August 10, 2023
Using gradient checkpointing and KV caching when generation happens in no grad context 🤗Transformers	2	229	September 28, 2024
Accuracy drops using Gradient checkpointing 🤗Transformers	0	149	September 7, 2023
LayoutLMv3 For Token Classification does not support Gradient_checkpointing 🤗Transformers	1	328	November 4, 2022
Degraded results after loading from checkpoint 🤗Accelerate	0	153	May 13, 2024

Gradient Checkpointing with external values

Related topics