I am a researcher in long-term care in the UK. I have trained a text classification model to identify loneliness in older people using the awesome
transformers library. The model outperforms all other models. My concern is that it was so easy to train that I do not understand what is happening under the hood.
The relevant part of my python code (based on this medium post) is:
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments) tokenizer = AutoTokenizer.from_pretrained("distilroberta-base") tokenize_func = lambda sentences: tokenizer(sentences['sentence_text'], \ padding="max_length", \ truncation=True) tok_train_ds = train_ds.map(tokenize_func, batched=True) tok_test_ds = test_ds.map(tokenize_func, batched=True) model = AutoModelForSequenceClassification.from_pretrained("distilroberta-base", num_labels=2)
I then instantiate a
trainer object of the
Trainer class and call
trainer.train() for 5 epochs.
According to my
config.json, the architecture is
This is a class which calls
RobertaClassificationHead, which extends
nn.Module in the normal
torch way, and defines the
def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
forward part of the class is:
def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x
I understand that the model which converts the input tokens to word embeddings is a complicated transformers model.
But am I right in interpreting the code I’ve posted to mean that, once the words are vectorized, the architecture is simply (excluding dropout layers):
sequence of word vectors -> dense layer -> tanh activation function -> hidden dense layer -> tanh activation function -> output layer
i.e. the actual classification is a straightforward multilayer perceptron (albeit with a high-dimensional input)?
Any advice would be appreciated - I am not that familiar with Pytorch and is so easy to use I may not always understand what I am doing!