Hi how dose weight decay rate affect learning rate?
From the documentation i got that lr_schedule is created like that and then it is passed to AdamWeightDecay as learning_rate argument
lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay( initial_learning_rate=init_lr, decay_steps=num_train_steps - num_warmup_steps, end_learning_rate=init_lr * min_lr_ratio, )
it doesn’t use weight decay rate and if I were to plot this schedule i would be just a straight line going from init_lr to end_lr
weight decay rate is used later in AdamWeightDecay class
def _decay_weights_op(self, var, learning_rate, apply_state): do_decay = self._do_use_weight_decay(var.name) if do_decay: return var.assign_sub( learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)]["weight_decay_rate"], use_locking=self._use_locking, ) return tf.no_op()
so is weight_decay_rate just another scalar that scales learning rate for particular training step??
for example if my lr_schedule created by tensorflow would look like this
[0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
and weight_decay_rate=0.01 then final learning rate would look like this
[0.001, 0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001]???
Am i getting this right?