How downstream tasks work

Roman · July 11, 2021, 3:27pm

Hello.
I was surprised that I only need to add a few lines of code to solve various tasks with the help of Bert. For exampe below is downstream task code for ML one:

  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
)

or for next sequence prediction one:

    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (cls): BertOnlyNSPHead(
    (seq_relationship): Linear(in_features=768, out_features=2, bias=True)
  )

But in other side for me not clear why it works… Why for example solving ML task I use last_hidden_state from Bert model - but for BertForNextSentencePrediction the pooler output? Or if I want to suggest own downstream task - what should I do at first? I found a lot of materials explaining how BERTworks - but nothing explaining how to build downstream task.

ehalit · July 13, 2021, 6:05am

During the pretraining procedure of BERT, there are two tasks: Masked Language Modeling and Next Sequence Prediction. Masked Language Prediction requires the model to make predictions for every token in the model, including the [MASK] token which is modeled by a layer that generates outputs over the entire vocabulary, hence the size 30522. The Next Sequence Prediction task uses the pooler layer in order to do a binary classification, namely whether the next sequence is a suitable continuation of the first sequence.

Therefore, the BertForPreTraining contains a BertPreTrainingHeads layer including both a language modeling head and a prediction head. However, for downstream language modeling tasks using BertForMaskedLM or BertForTokenClassification, the pooler layer is discarded since there is no need for sequence classification. Conversely, for downstream sequence classification tasks using BertForNextSentencePrediction or BertForSequencePrediction, the pooler layer is retained.

Topic		Replies	Views
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7531	September 29, 2022
Current best practice for final linear classifier layer(s)? Beginners	3	2421	September 12, 2020
Multi-task learning for masked language modeling and token classification 🤗Transformers	0	602	October 5, 2021
Having issues finetuning a Bert model pretrained from scratch on downstream task (GLUE Dataset)! Intermediate	0	713	March 26, 2022
Understanding BertLMPredictionHead 🤗Transformers	3	2283	February 15, 2021

How downstream tasks work

Related topics