While using Distillbert model from hugging face i found out we are having a dropout layer after the classification layer. Before applying softmax, why are we droping out informations ? It seems like a bad idea for me, but want to know more because hugging face had set this to 0.2
as default parameter Is there any good reason behind this