Hello.
I was surprised that I only need to add a few lines of code to solve various tasks with the help of Bert. For exampe below is downstream task code for ML one:
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=30522, bias=True)
)
)
or for next sequence prediction one:
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(cls): BertOnlyNSPHead(
(seq_relationship): Linear(in_features=768, out_features=2, bias=True)
)
But in other side for me not clear why it works… Why for example solving ML task I use last_hidden_state from Bert model - but for BertForNextSentencePrediction the pooler output? Or if I want to suggest own downstream task - what should I do at first? I found a lot of materials explaining how BERTworks - but nothing explaining how to build downstream task.