A common modeling head for models whose main output are embedding vectors is typically a dense layer that projects the embedding output dimensionality of the base transformer model to to some other dimensionality. In particular, this is a common training setup when doing knowledge distillation: a student model is trained to emulate a teacher model, but the student model’s last hidden state output dimension is not the same as the teacher model. What to do? Add a dense linear transformation.
Another common scenario for a linear transformation head is found in multimodal models such as CLIP, where the text and image models both are required to output an embedding of the same dimensionality. You have solved this in transformers by implementing a specific CLIPModel class and adding the linear layers as part of the model implementation.
I would however argue that projection heads are common and generic enough to perhaps warrant their own modeling head implementation in most models. I.e. an XForLinearTransformation, or XForProjection modeling head for almost all models in the transformers library.
Currently, many models that are trained with SentenceTransformers, as well as any custom multilingual CLIP models where a multilingual or monolingual text model from non-English languages were trained to emulate the embeddings of the English teacher CLIP text model, cannot easily be loaded with transformers. They typically require an external library to make the loading easy and intuitive for the user.
Models and workflows that deal directly with embeddings are becoming more and more common. The transformers library should be able to accomodate users that create their own CLIPs or sentence-transformers using any model class of their choice as the source model. It should not be necessary for Huggingface to implement a new model class everytime someone wants to attach a custom projection head to their BERTs, RoBERTas, T5, or whatever model.
When writing this I came to the realization that maybe ModelForSequenceClassification could technically be wrangled into instantiating such a head and returning logits. However, I think that this use case of ModelForSequenceClassification is so unexpected and unintuitive that nobody thinks to even try it. Would it be possible to use BertForSequenceClassification for this purpose?
Anyway, hoping to get your opinions and thoughts on my suggestion.
I have looked at trying to adapt ModelForSequenceClassification for this task today. The main issue I am running in to is that the ModelForSequenceClassification implementation is hardcoded to only send the pooled [CLS] output of BertModel to the linear classifier layer.
The majority of implementations that train models to output semantic embeddings will do a mean pooling of all embeddings in the last hidden state layer where they also take attention masks into account in the averaging. Something like:
# Manual averaging of that takes attention masks in to account
pooled_output = (last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1) / attention_mask.sum(dim=1)
where last_hidden_state is (batch_size, sequence_length, hidden_size).
If you look at all the sentence-transformers models for example, all their models have the instruction to perform manual mean pooling if you want to use the model with transformers library (see under Usage: Huggingface Transformers).
I think, however, that the majority of users find working with hidden_states outputs difficult. And this isn’t helped by the fact that different model classes have different default outputs in different orders, with some of them (like ModelForSequenceClassification) not being able to output things like last_hidden_state. The inconvenience of all of this is likely what causes most contributors of models whose main output are semantic embeddings to create their own libraries.
What about perhaps making ModelForSequenceClassification choice of pooling strategy more flexible? Would it be possible to add an argument that lets users choose pooling strategy? pooling_strategy (str): "CLS" | "mean" | "max" | "concat". I personally only really care about the mean pooling, as that one by far is the most commonly used one.
Could I start a feature request issue regarding this, or are the chances for a change low?
We won’t modify ModelForSequenceClassification like this as Transformers is very much not a modular toolbox. To get the last linear layer, the best is to use the base model and takes the last_hidden_state attribute of its output.
Alright, thanks for the answer. I learned some new things about the library from this discussion!
Extracting last_hidden_state from base model is OK. I was just hoping there was a way to bend ModelForSequenceClassification so that it would also work with models of the type
Base Model + custom pooling output from last_hidden_state + linear transformation head
This way users could both i) load the abovementioned type of models through transformers instead of external libraries, and ii) extract embeddings in a one-liner inference step via transformers instead of having to load external library for the one-liner.