I’m interested in 1-sentence and 2-sentence text classification, so I’ve been looking at the classification heads for BERT, GPT2, XLNet, and RoBERTa. I have a few questions:
1. I see that there are dedicated classification classes BertForSequenceClassification, XLNetForSequenceClassification, and RobertaForSequenceClassification. However, there is no XForSequenceClassification class for GPT2. Is there any documentation to help us write our own?
2. When I look at the classification heads for BERT, XLNet, and RoBERTa, the layer structure for producing the logits appears to be different for each one. I would think that the final few layers would be exactly the same.
For example, here is the code for the BERT classification head:
class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
def forward( ... ):
outputs = self.bert( ... )
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
Here is the code for the XLNet classification head:
class XLNetForSequenceClassification(XLNetPreTrainedModel):
def __init__(self, config):
self.transformer = XLNetModel(config)
self.sequence_summary = SequenceSummary(config)
self.logits_proj = nn.Linear(config.d_model, config.num_labels)
def forward( ... ):
transformer_outputs = self.transformer( ... )
output = transformer_outputs[0]
output = self.sequence_summary(output)
logits = self.logits_proj(output)
Here is the code for the RoBERTa classification head:
class RobertaForSequenceClassification(BertPreTrainedModel):
config_class = RobertaConfig
base_model_prefix = "roberta"
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.roberta = RobertaModel(config)
self.classifier = RobertaClassificationHead(config)
def forward( ... ):
outputs = self.roberta( ... )
sequence_output = outputs[0]
logits = self.classifier(sequence_output)
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
From the above code, producing the output logits involves:
- BERT: dropout, linear
- XLNet: linear
- RoBERTa: dropout, linear, tanh, dropout, linear
Why does each model implement the final layers differently? Are the implementations taken from the original papers? Wouldn’t it be better to make the classification layers exactly the same for each model so that classification relative performance would be a result of the models’ internal architecture rather than the classification layer?
Thank you for any help.