Is the distilroberta sequence classification architecture essentially a sequential neural network?

samr · August 16, 2022, 10:07am

I am a researcher in long-term care in the UK. I have trained a text classification model to identify loneliness in older people using the awesome transformers library. The model outperforms all other models. My concern is that it was so easy to train that I do not understand what is happening under the hood.

The relevant part of my python code (based on this medium post) is:

from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments)

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
tokenize_func = lambda sentences: tokenizer(sentences['sentence_text'], \
                                            padding="max_length", \
                                            truncation=True)
tok_train_ds = train_ds.map(tokenize_func, batched=True)
tok_test_ds = test_ds.map(tokenize_func, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("distilroberta-base", num_labels=2)

I then instantiate a trainer object of the Trainer class and call trainer.train() for 5 epochs.

According to my config.json, the architecture is RobertaForSequenceClassification.

This is a class which calls RobertaClassificationHead, which extends nn.Module in the normal torch way, and defines the __init__ as:

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

The forward part of the class is:

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

I understand that the model which converts the input tokens to word embeddings is a complicated transformers model.

But am I right in interpreting the code I’ve posted to mean that, once the words are vectorized, the architecture is simply (excluding dropout layers):

sequence of word vectors -> dense layer -> tanh activation function -> 
                  hidden dense layer -> tanh activation function -> output layer

i.e. the actual classification is a straightforward multilayer perceptron (albeit with a high-dimensional input)?

Any advice would be appreciated - I am not that familiar with Pytorch and is so easy to use I may not always understand what I am doing!

Topic		Replies	Views
Sequence Classification Models Fail to Train on Simple Classification Problem! Hints? Models	3	227	June 21, 2024
What is the classification head doing exactly? 🤗Transformers	16	23898	November 4, 2024
DistilBERT and CLS token Beginners	2	2422	February 21, 2021
DistilBert for Self-Supervision - switch heads for pre-training: MaskedLM and SequenceClassification Beginners	0	220	February 16, 2023
Can we train Sentence transformer model for Sequence classification 🤗Transformers	5	6334	June 14, 2023

Is the distilroberta sequence classification architecture essentially a sequential neural network?

Related topics