How downstream tasks work

Hello.
I was surprised that I only need to add a few lines of code to solve various tasks with the help of Bert. For exampe below is downstream task code for ML one:

  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
)

or for next sequence prediction one:

    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (cls): BertOnlyNSPHead(
    (seq_relationship): Linear(in_features=768, out_features=2, bias=True)
  )

But in other side for me not clear why it works… Why for example solving ML task I use last_hidden_state from Bert model - but for BertForNextSentencePrediction the pooler output? Or if I want to suggest own downstream task - what should I do at first? I found a lot of materials explaining how BERTworks - but nothing explaining how to build downstream task.

During the pretraining procedure of BERT, there are two tasks: Masked Language Modeling and Next Sequence Prediction. Masked Language Prediction requires the model to make predictions for every token in the model, including the [MASK] token which is modeled by a layer that generates outputs over the entire vocabulary, hence the size 30522. The Next Sequence Prediction task uses the pooler layer in order to do a binary classification, namely whether the next sequence is a suitable continuation of the first sequence.

Therefore, the BertForPreTraining contains a BertPreTrainingHeads layer including both a language modeling head and a prediction head. However, for downstream language modeling tasks using BertForMaskedLM or BertForTokenClassification, the pooler layer is discarded since there is no need for sequence classification. Conversely, for downstream sequence classification tasks using BertForNextSentencePrediction or BertForSequencePrediction, the pooler layer is retained.

1 Like