How to add gru layer in distilbert model?

Hey @jarif! I don’t really know how to help on that, but just for curiosity, why would you add those layers to the model? Doesn’t the self-attention mechanism have the same intent? (Measuring and weighting how relevant past and further information is to the current embedding)

maybe you can ask the question on distilbert-base-uncased · Discussions as this will notify the authors too?

my superviser told me to do that