I tried to use BERT -large-uncased to finetune on GLUE bechmark, and I loaded checkpoint from Hub. When I saw the details:
My understand is that the finetuning model for GLUE is BERT + classifier, the chechpoint is only for BERT. So missing keys is belong to classifer. But I think there is a pooler behind BERT attention, so pooler keys are also missing, right? (But why I didn’t see them in missing keys?) And I think the unexpected keys actually can be assigned to pooler.
And I also noticed that during the finetuning, not loading pooler’s weights (keeping the weights initialized normally) is better.