I am a researcher in long-term care in the UK. I have trained a text classification model to identify loneliness in older people using the awesome transformers
library. The model outperforms all other models. My concern is that it was so easy to train that I do not understand what is happening under the hood.
The relevant part of my python code (based on this medium post) is:
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments)
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
tokenize_func = lambda sentences: tokenizer(sentences['sentence_text'], \
padding="max_length", \
truncation=True)
tok_train_ds = train_ds.map(tokenize_func, batched=True)
tok_test_ds = test_ds.map(tokenize_func, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("distilroberta-base", num_labels=2)
I then instantiate a trainer
object of the Trainer
class and call trainer.train()
for 5 epochs.
According to my config.json
, the architecture is RobertaForSequenceClassification
.
This is a class which calls RobertaClassificationHead
, which extends nn.Module
in the normal torch
way, and defines the __init__
as:
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
The forward
part of the class is:
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
I understand that the model which converts the input tokens to word embeddings is a complicated transformers model.
But am I right in interpreting the code I’ve posted to mean that, once the words are vectorized, the architecture is simply (excluding dropout layers):
sequence of word vectors -> dense layer -> tanh activation function ->
hidden dense layer -> tanh activation function -> output layer
i.e. the actual classification is a straightforward multilayer perceptron (albeit with a high-dimensional input)?
Any advice would be appreciated - I am not that familiar with Pytorch and is so easy to use I may not always understand what I am doing!