What is the classification head doing exactly?

olaffson · September 20, 2021, 2:05pm

Hello,

When using a transformer model for text classification, one usually loads a model and then uses AutoModelForSequenceClassification to train the classifier over the N classes in the data.

My question: which model is actually used for classification? Is it a logistic model (with uses as input the CLS representation)?
In the case of several classes (say bad, neutral, good) the usual methodology in machine learning is to train several one-vs-all classifiers and then predict the label with most votes. Is this what is happening under the hood with huggingface?

Thanks!

olaffson · September 20, 2021, 2:16pm

@nielsr I would be curious to have your take on this, if you have a few moments. Your comments have been incredibly useful so far. Thanks!

BramVanroy · September 20, 2021, 2:21pm

The exact implementation of XXXForSequenceClassification differs from model to model. But you can have a look at e.g. the BERT implementation:

github.com

huggingface/transformers/blob/04976a32dc555667afa994e8f918cbee88d84a4f/src/transformers/models/bert/modeling_bert.py#L1481

    
      
                  )
          
          

          
@add_start_docstrings(
              """
              Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled
              output) e.g. for GLUE tasks.
              """,
              BERT_START_DOCSTRING,
          )
          class BertForSequenceClassification(BertPreTrainedModel):
              def __init__(self, config):
                  super().__init__(config)
                  self.num_labels = config.num_labels
                  self.config = config
          
          
        self.bert = BertModel(config)
                  classifier_dropout = (
                      config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
                  )
                  self.dropout = nn.Dropout(classifier_dropout)

In this case it is simply a dropout and then a linear layer on top of the pooled outputs.

olaffson · September 20, 2021, 2:27pm

oh, thanks @BramVanroy ! I am not a hardcore user of pytorch but I am looking at the code. You say that this is a linear layer on the pooled output. But can the model generate multiple classes then?

BramVanroy · September 20, 2021, 2:50pm

What do you mean? Why wouldn’t it? A linear layer is simply a projection of X dimensions to Y dimensions, e.g. 512 to 512, or, 768 to 1, or any other that you can think of. As you can see here:

github.com

huggingface/transformers/blob/04976a32dc555667afa994e8f918cbee88d84a4f/src/transformers/models/bert/modeling_bert.py#L1492

    
      
          def __init__(self, config):
              super().__init__(config)
              self.num_labels = config.num_labels
              self.config = config
          
          
    self.bert = BertModel(config)
              classifier_dropout = (
                  config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
              )
              self.dropout = nn.Dropout(classifier_dropout)
              self.classifier = nn.Linear(config.hidden_size, config.num_labels)
          
          
    self.init_weights()
          
          
@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
          @add_code_sample_docstrings(
              tokenizer_class=_TOKENIZER_FOR_DOC,
              checkpoint=_CHECKPOINT_FOR_DOC,
              output_type=SequenceClassifierOutput,
              config_class=_CONFIG_FOR_DOC,
          )

The linear layer will output the number of classes that you request. If you have five classes, it will output five values. How you deal with those values (single or multi label classification) then depends on your loss function. See:

github.com

huggingface/transformers/blob/04976a32dc555667afa994e8f918cbee88d84a4f/src/transformers/models/bert/modeling_bert.py#L1551-L1562

    
      
          if self.config.problem_type == "regression":
              loss_fct = MSELoss()
              if self.num_labels == 1:
                  loss = loss_fct(logits.squeeze(), labels.squeeze())
              else:
                  loss = loss_fct(logits, labels)
          elif self.config.problem_type == "single_label_classification":
              loss_fct = CrossEntropyLoss()
              loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
          elif self.config.problem_type == "multi_label_classification":
              loss_fct = BCEWithLogitsLoss()
              loss = loss_fct(logits, labels)

nielsr · September 20, 2021, 3:46pm

One typically just places a linear layer (nn.Linear) on top of the final hidden state of the [CLS] token. In other words, a linear classifier is sufficient. It converts the final hidden state vector into a vector that represents the classes.

For HuggingFace models, one typically only learns a single classifier on top of the base model, for binary, multi-class and multi-label classification.

olaffson · September 20, 2021, 8:09pm

Very interesting @nielsr @BramVanroy . Thanks. Its great to know what huggingface has a consistent linear classification head in all cases.

If my understanding is correct, then the linear classification head actually learns a big matrix whose dimensions are rows * columns = vocabulary size * classes.

Is that right? Thanks!

BramVanroy · September 21, 2021, 7:28am

No, that is not right. As I said before, its input dimensions are usually the output dimensions of the base model (e.g. 768) and the output dimensions are the number of classes, e.g. 3 for “neutral, positive, negative”. The classification weights are, relatively speaking, quite small in many downstream tasks.

During language modeling, the LM head has the same input dimensions, but the output dimensions are the same size as the vocabulary: it provides you with a probability for each token how well it fits in a given position. This does lead to a large classifier.

olaffson · September 21, 2021, 12:09pm

Hi @BramVanroy I think we actually agree. What I am saying, in mathematical terms, is that the linear layers performs the linear operation

A * B

where A is a number of obs x 768 matrix and B is a 768 x number of classes matrix.

Thus the result of the matrix multiplication is number of obs x number of classes whose row i gives the score for each class for observation i (before actually transforming to probabilities via a softmax).

Does that make sense?

BramVanroy · September 21, 2021, 12:59pm

So the linear layer does a linear transformation on the given data by multiplying it with a learned transformation matrix (and optionally adding bias). This (transposed) transformation matrix is of the shape output_features * input_features, which in the case of the classification layer is n_classes * model_hidden_size. I do not understand why you bring the vocabulary size into play, as it is not important at this stage.

olaffson · September 21, 2021, 1:59pm

Thanks @BramVanroy for clarifying your point. You are right I mentioned the vocabulary size early on which is incorrect. What I mean is the following (and again I think we agree).

The output of the language model is a vector representation of the [CLS] token for each sentence. This representation is a row vector with 768 dimensions. So if you have N documents in your data you can stack all your row vectors into a big matrix A with dimensions N * 768

Then what I think the linear layer is doing is to learn essentially another matrix B, whose dimensions are 768 x k number of classes because when you do the matrix multiplication of A times B you get as output a matrix with N rows (your N input sentences) and k columns (the score for each possible class). I believe this is also what you are saying (except that your matrices are transposed). Does that make sense?

Thanks! It is always interesting to discuss the underlying mechanisms. Helps understand the models better.

BramVanroy · September 21, 2021, 3:38pm

Yes, this is almost correct. For completeness’ sake: the linear layer does not output “scores for the classes”. It produces logits. These can be transformed into probabilities with a soft max as was mentioned earlier.

It always helps to look directly at some code, I think. At least that’s how I get to better understand these things.

olaffson · September 21, 2021, 7:21pm

perfect. I am sure this conversation will be helpful to others, too. Thanks!

theothertom · September 26, 2022, 6:18pm

hugging face gives you a pre-trained model which has an architecture, and the weights are downloaded and added to the model when you initialize the model. If you are using the API AutoModel(checkpoint), when you run it you will get the final layer as the out. here you will need to make a custom head. in your case a classification head, usually, is done using linear layers which can get you an output of dimension (batch_size,n) n- for say (pos, neu, neg,… all the stuff in between), set the final linear layer output to 3 if you want just 3 logits, or huggingface has another API AutoModelFoseuenceClassification(checkpoint,num_labels=3) num_labels argument can be set to 3 for classification head to output logits of dim (batch_size,3).

SaraAmd · March 21, 2023, 5:50pm

I want to use the BertForSequenceClassification for binary classification. If I set the num_labels to 2, it throws an error " ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 1]))". If i set it to one then it works can anyone tell me why this happens?

MPA · April 14, 2023, 8:35am

use num_labels=1

Totototo · November 4, 2024, 2:58pm

Slight addon to this convo : interesting to see the forward pass for this model, and how we take the pooled output out of Bert.

def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Note how the pooling is just done by taking the second element ([1]) out of the tuple output from Bert. The first element of the tuple is all the hidden state, while the second element is the pooled output.

Now how is that pooled output made ? You have to check the BertPooler class for that, below.

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Thus the pooled output is a feedforward pass (hidden_size to hidden_size) on top of the hidden state of the CLS hidden state.

Thus, if I’m not mistaken, for sequence classification we have :

a feedforward pass (hidden_size to hidden_size) with tanh activation, on top of the CLS token,
dropout
a classifier (hidden_size to num_classes)

Which is slightly different to what has been said above.

Topic		Replies	Views
How to use Auto Model For SequenceClassification for Multi-Class Text Classification? 🤗AutoTrain	1	3735	February 26, 2023
How do I do multi Class (multi head) classification? 🤗Transformers	6	4416	October 18, 2022
Transformers, am i only using a Encoder for Binary Classification? Beginners	1	1636	January 4, 2021
Trying to understand XForSequenceClassification heads Intermediate	8	1322	September 24, 2020
Python nlp transformers library understanding the methods/functions/properties Beginners	0	557	December 29, 2021

What is the classification head doing exactly?

Related topics