When we design an encoder to have XLM-R base followed by a linear layer (or with some other parameter blocks), while training do we have to give different learning rates to XLM-R and rest of the model or same ? My XLM-R alone converges with 5e-6 learning rate. Should I give 1e-3 range values for rest of the model (excluding XLM-R) or they can still be trained with very low learning rate (5e-6) ?
Besides are there any modifications I have to do the entire encoder (XLM-R with additional randomly initialized layers) during training ?