I would like to create a custom model (in this case for text classification) that works on top of an arbitrary pre-trained transformer model body. More specifically, I want to use some transformer model (together with its tokenizer) to get an embedding for the given text and then do whatever on top of this embedding. The code below was inspired by the DistilBertForSequenceClassification
model and works for the checkpoint "distilbert-base-uncased"
, but fails already for "bert-base-uncased"
since there the embedding dimensionality is stored in config.hidden_size
instead of config.dim
:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
class TransformerSequenceClassifier(torch.nn.Module):
def __init__(self, num_labels, pretrained_name, dropout=0.1):
super().__init__()
self.num_labels = num_labels
# load pre-trained transformer
self.transformer = AutoModel.from_pretrained(pretrained_name)
# initialize other layers (head after the transformer body)
self.pre_classifier = torch.nn.Linear(self.transformer.config.dim, self.transformer.config.dim)
self.classifier = torch.nn.Linear(self.transformer.config.dim, num_labels)
self.dropout = torch.nn.Dropout(dropout)
def forward(self, input_ids=None, **kwargs):
# get text representation from transformer
transformer_output = self.transformer(
input_ids=input_ids,
**kwargs,
)
hidden_state = transformer_output[0] # (bs, seq_len, dim)
pooled_output = hidden_state[:, 0] # (bs, dim)
# apply classification layers
pooled_output = self.pre_classifier(pooled_output) # (bs, dim)
pooled_output = F.relu(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
output = self.classifier(pooled_output) # (bs, num_labels)
return output
if __name__ == '__main__':
# initialize model and corresponding tokenizer
pretrained_name = "distilbert-base-uncased"
model = TransformerSequenceClassifier(2, pretrained_name)
tokenizer = AutoTokenizer.from_pretrained(pretrained_name)
# apply model to some example sentences
batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
return_tensors="pt"
)
y_pred = model(**batch)
For my code to work I would need a model-agnostic way to:
- prune task specific heads from the model (if it has any)
- compute a single embedding vector for a sequence of input tokens (for BERT-based models afaik this is the representation for the
[CLS]
token at the beginning of the sequence, but I’m not sure about the rest of the model zoo) - know in advance what the dimensionality of this embedding will be
Are there any suggestions on how to accomplish the above steps? I’m also happy about a solution that works only for BERT-based models, but I can’t believe that I already failed on point 3…