Hi everyone.
I’m trying to use GPT2 model for multi-label text classification. I’m using 256 maximum tokens. I don’t want to use GPT2ForSequenceClassification directly, as I need a dropout layer between the GPT2 model and classifier head.
Inspired from the source code of GPT2ForSequenceClassification I tried defining my model as follows:
from transformers import AutoConfig
MODEL_NAME = "gpt2"
gpt2_config = AutoConfig.from_pretrained(MODEL_NAME)
gpt2_model = AutoModel.from_pretrained(pretrained_model_name_or_path=MODEL_NAME, config=gpt2_config)
gpt2_model.resize_token_embeddings(len(tokenizer))
gpt2_model.config.pad_token_id = gpt2_model.config.eos_token_id
class Model(nn.Module):
def __init__(self):
super().__init__()
self.l1 = gpt2_model
self.l2 = nn.Dropout(0.3)
self.l3 = nn.Linear(self.l1.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask, token_type_ids):
out = self.l1(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
out = self.l2(out[0])
out = self.l3(out)
return out
model = Model()
model.to(device);
However, the output shape of the GPT2 model is [BATCH_SIZE, 256, self.l1.config.hidden_size], where self.l1.config.hidden_size = 786, and when this goes through dropout and the classifier head, the resultant shape is [BATCH_SIZE, 256, n_classes]. This is a problem since my ground truth has the shape [BATCH_SIZE, n_classes].
I read online, and it seems like it’s enough to use the last token output from GPT2 model, so I revised my model to this (Note the difference from
out = self.l2(out[0])
to
out = self.l2(out[0][:, -1, :])
class Model(nn.Module):
def __init__(self):
super().__init__()
self.l1 = gpt2_model
self.l2 = nn.Dropout(0.3)
self.l3 = nn.Linear(self.l1.config.hidden_size, len(features.mlb.classes_))
def forward(self, input_ids, attention_mask, token_type_ids):
out = self.l1(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
out = self.l2(out[0][:, -1, :])
out = self.l3(out)
return out
The output shape of above model is [BATCH_SIZE, n_classes] and this model is training now. (still waiting to benchmark it on test set).
I wanted to ask, is my revised model correct? Particularly, does it make sense to use out = self.l2(out[0][:, -1, :]).
If it is correct, then why do we use something like out = self.l2(out[0]) here GPT2ForSequenceClassification ? Why is it working in this source code, but failing in my case?