Is the distilroberta sequence classification architecture essentially a sequential neural network?

I am a researcher in long-term care in the UK. I have trained a text classification model to identify loneliness in older people using the awesome transformers library. The model outperforms all other models. My concern is that it was so easy to train that I do not understand what is happening under the hood.

The relevant part of my python code (based on this medium post) is:

from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments)

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
tokenize_func = lambda sentences: tokenizer(sentences['sentence_text'], \
                                            padding="max_length", \
tok_train_ds =, batched=True)
tok_test_ds =, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("distilroberta-base", num_labels=2)

I then instantiate a trainer object of the Trainer class and call trainer.train() for 5 epochs.

According to my config.json, the architecture is RobertaForSequenceClassification.

This is a class which calls RobertaClassificationHead, which extends nn.Module in the normal torch way, and defines the __init__ as:

    def __init__(self, config):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

The forward part of the class is:

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

I understand that the model which converts the input tokens to word embeddings is a complicated transformers model.

But am I right in interpreting the code I’ve posted to mean that, once the words are vectorized, the architecture is simply (excluding dropout layers):

sequence of word vectors -> dense layer -> tanh activation function -> 
                  hidden dense layer -> tanh activation function -> output layer

i.e. the actual classification is a straightforward multilayer perceptron (albeit with a high-dimensional input)?

Any advice would be appreciated - I am not that familiar with Pytorch and :hugs: is so easy to use I may not always understand what I am doing!