Hi how dose weight decay rate affect learning rate?
From the documentation i got that lr_schedule is created like that and then it is passed to AdamWeightDecay as learning_rate argument
lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
initial_learning_rate=init_lr,
decay_steps=num_train_steps - num_warmup_steps,
end_learning_rate=init_lr * min_lr_ratio,
)
it doesn’t use weight decay rate and if I were to plot this schedule i would be just a straight line going from init_lr to end_lr
weight decay rate is used later in AdamWeightDecay class
def _decay_weights_op(self, var, learning_rate, apply_state):
do_decay = self._do_use_weight_decay(var.name)
if do_decay:
return var.assign_sub(
learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)]["weight_decay_rate"],
use_locking=self._use_locking,
)
return tf.no_op()
so is weight_decay_rate just another scalar that scales learning rate for particular training step??
for example if my lr_schedule created by tensorflow would look like this [0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
and weight_decay_rate=0.01 then final learning rate would look like this [0.001, 0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001]
???
Am i getting this right?