Create a custom model that works with any pretrained transformer body

I would like to create a custom model (in this case for text classification) that works on top of an arbitrary pre-trained transformer model body. More specifically, I want to use some transformer model (together with its tokenizer) to get an embedding for the given text and then do whatever on top of this embedding. The code below was inspired by the DistilBertForSequenceClassification model and works for the checkpoint "distilbert-base-uncased", but fails already for "bert-base-uncased" since there the embedding dimensionality is stored in config.hidden_size instead of config.dim:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


class TransformerSequenceClassifier(torch.nn.Module):

    def __init__(self, num_labels, pretrained_name, dropout=0.1):
        super().__init__()
        self.num_labels = num_labels
        # load pre-trained transformer
        self.transformer = AutoModel.from_pretrained(pretrained_name)
        # initialize other layers (head after the transformer body)
        self.pre_classifier = torch.nn.Linear(self.transformer.config.dim, self.transformer.config.dim)
        self.classifier = torch.nn.Linear(self.transformer.config.dim, num_labels)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, input_ids=None, **kwargs):
        # get text representation from transformer
        transformer_output = self.transformer(
            input_ids=input_ids,
            **kwargs,
        )
        hidden_state = transformer_output[0]                # (bs, seq_len, dim)
        pooled_output = hidden_state[:, 0]                  # (bs, dim)
        # apply classification layers
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        pooled_output = F.relu(pooled_output)               # (bs, dim)
        pooled_output = self.dropout(pooled_output)         # (bs, dim)
        output = self.classifier(pooled_output)             # (bs, num_labels)
        return output


if __name__ == '__main__':
    # initialize model and corresponding tokenizer
    pretrained_name = "distilbert-base-uncased"
    model = TransformerSequenceClassifier(2, pretrained_name)
    tokenizer = AutoTokenizer.from_pretrained(pretrained_name)
    # apply model to some example sentences
    batch = tokenizer(
        ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
        padding=True,
        truncation=True,
        return_tensors="pt"
    )
    y_pred = model(**batch)

For my code to work I would need a model-agnostic way to:

  1. prune task specific heads from the model (if it has any)
  2. compute a single embedding vector for a sequence of input tokens (for BERT-based models afaik this is the representation for the [CLS] token at the beginning of the sequence, but I’m not sure about the rest of the model zoo)
  3. know in advance what the dimensionality of this embedding will be

Are there any suggestions on how to accomplish the above steps? I’m also happy about a solution that works only for BERT-based models, but I can’t believe that I already failed on point 3…

1 Like

Note that hidden_size is a property on all configurations, so if you use it instead of dim in your example, you should be good.

For pruning specific heads, I don’t think you will have one when using the AutoModel architecture since it’s supposed to be the bare model.

So the main issue will be to get the representation. For this, I’m afraid there is nothing generic that will work accross the model zoo (which is why the XxxForSequenceClassification are implemented in separate files) since some models expect to use the first token, others the last, others the mean etc.

2 Likes