Do the common tricks in transformers help with RNNs?

Does anybody know any research or work that utilizes common tricks (layer norm, masked language training, etc) commonly used with transformers with RNNs?

Do these things still help improve RNNs? If not, are there reasons you think these techniques would/would not translate to rnns?