How to put a classification head on top of GPT2 model?

Hi everyone.

I’m trying to use GPT2 model for multi-label text classification. I’m using 256 maximum tokens. I don’t want to use GPT2ForSequenceClassification directly, as I need a dropout layer between the GPT2 model and classifier head.

Inspired from the source code of GPT2ForSequenceClassification I tried defining my model as follows:

from transformers import AutoConfig


MODEL_NAME = "gpt2"
gpt2_config = AutoConfig.from_pretrained(MODEL_NAME)
gpt2_model = AutoModel.from_pretrained(pretrained_model_name_or_path=MODEL_NAME, config=gpt2_config)

gpt2_model.resize_token_embeddings(len(tokenizer))
gpt2_model.config.pad_token_id = gpt2_model.config.eos_token_id

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = gpt2_model
        self.l2 = nn.Dropout(0.3)
        self.l3 = nn.Linear(self.l1.config.hidden_size, n_classes)
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        out = self.l1(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        out = self.l2(out[0])
        out = self.l3(out)
        return out

model = Model()
model.to(device);

However, the output shape of the GPT2 model is [BATCH_SIZE, 256, self.l1.config.hidden_size], where self.l1.config.hidden_size = 786, and when this goes through dropout and the classifier head, the resultant shape is [BATCH_SIZE, 256, n_classes]. This is a problem since my ground truth has the shape [BATCH_SIZE, n_classes].

I read online, and it seems like it’s enough to use the last token output from GPT2 model, so I revised my model to this (Note the difference from

out = self.l2(out[0])

to

out = self.l2(out[0][:, -1, :])
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = gpt2_model
        self.l2 = nn.Dropout(0.3)
        self.l3 = nn.Linear(self.l1.config.hidden_size, len(features.mlb.classes_))
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        out = self.l1(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        out = self.l2(out[0][:, -1, :])
        out = self.l3(out)
        return out

The output shape of above model is [BATCH_SIZE, n_classes] and this model is training now. (still waiting to benchmark it on test set).

I wanted to ask, is my revised model correct? Particularly, does it make sense to use out = self.l2(out[0][:, -1, :]).

If it is correct, then why do we use something like out = self.l2(out[0]) here GPT2ForSequenceClassification ? Why is it working in this source code, but failing in my case?