What is the classification head doing exactly?


When using a transformer model for text classification, one usually loads a model and then uses AutoModelForSequenceClassification to train the classifier over the N classes in the data.

  • My question: which model is actually used for classification? Is it a logistic model (with uses as input the CLS representation)?

  • In the case of several classes (say bad, neutral, good) the usual methodology in machine learning is to train several one-vs-all classifiers and then predict the label with most votes. Is this what is happening under the hood with huggingface?


@nielsr I would be curious to have your take on this, if you have a few moments. Your comments have been incredibly useful so far. Thanks!

The exact implementation of XXXForSequenceClassification differs from model to model. But you can have a look at e.g. the BERT implementation:

In this case it is simply a dropout and then a linear layer on top of the pooled outputs.

1 Like

oh, thanks @BramVanroy ! I am not a hardcore user of pytorch but I am looking at the code. You say that this is a linear layer on the pooled output. But can the model generate multiple classes then?

What do you mean? Why wouldn’t it? A linear layer is simply a projection of X dimensions to Y dimensions, e.g. 512 to 512, or, 768 to 1, or any other that you can think of. As you can see here:

The linear layer will output the number of classes that you request. If you have five classes, it will output five values. How you deal with those values (single or multi label classification) then depends on your loss function. See:

1 Like

One typically just places a linear layer (nn.Linear) on top of the final hidden state of the [CLS] token. In other words, a linear classifier is sufficient. It converts the final hidden state vector into a vector that represents the classes.

For HuggingFace models, one typically only learns a single classifier on top of the base model, for binary, multi-class and multi-label classification.

1 Like

Very interesting @nielsr @BramVanroy . Thanks. Its great to know what huggingface has a consistent linear classification head in all cases.

If my understanding is correct, then the linear classification head actually learns a big matrix whose dimensions are rows * columns = vocabulary size * classes.

Is that right? Thanks!

No, that is not right. As I said before, its input dimensions are usually the output dimensions of the base model (e.g. 768) and the output dimensions are the number of classes, e.g. 3 for “neutral, positive, negative”. The classification weights are, relatively speaking, quite small in many downstream tasks.

During language modeling, the LM head has the same input dimensions, but the output dimensions are the same size as the vocabulary: it provides you with a probability for each token how well it fits in a given position. This does lead to a large classifier.

Hi @BramVanroy I think we actually agree. What I am saying, in mathematical terms, is that the linear layers performs the linear operation

A * B

where A is a number of obs x 768 matrix and B is a 768 x number of classes matrix.

Thus the result of the matrix multiplication is number of obs x number of classes whose row i gives the score for each class for observation i (before actually transforming to probabilities via a softmax).

Does that make sense?

So the linear layer does a linear transformation on the given data by multiplying it with a learned transformation matrix (and optionally adding bias). This (transposed) transformation matrix is of the shape output_features * input_features, which in the case of the classification layer is n_classes * model_hidden_size. I do not understand why you bring the vocabulary size into play, as it is not important at this stage.

1 Like

Thanks @BramVanroy for clarifying your point. You are right I mentioned the vocabulary size early on which is incorrect. What I mean is the following (and again I think we agree).

The output of the language model is a vector representation of the [CLS] token for each sentence. This representation is a row vector with 768 dimensions. So if you have N documents in your data you can stack all your row vectors into a big matrix A with dimensions N * 768

Then what I think the linear layer is doing is to learn essentially another matrix B, whose dimensions are 768 x k number of classes because when you do the matrix multiplication of A times B you get as output a matrix with N rows (your N input sentences) and k columns (the score for each possible class). I believe this is also what you are saying (except that your matrices are transposed). Does that make sense?

Thanks! It is always interesting to discuss the underlying mechanisms. Helps understand the models better.

Yes, this is almost correct. For completeness’ sake: the linear layer does not output “scores for the classes”. It produces logits. These can be transformed into probabilities with a soft max as was mentioned earlier.

It always helps to look directly at some code, I think. At least that’s how I get to better understand these things.

1 Like

perfect. I am sure this conversation will be helpful to others, too. Thanks!

1 Like