I have a quick question regarding SequenceClassification with the RobertaClassificationHead: The implementation of the dense layer on-top of the transformer has config.hidden_size x config.hidden_size
connections. From a theoretical point of view, would it make sense to let the user choose the number of the dimension in the dense/projection layers? If it makes sense, what would be the best way to do this right now?
This is what would probably have done what I expected:
self.dense = nn.Linear(config.hidden_size, config.proj_dim)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.proj_dim, config.num_labels)
I arrived at this question while experimenting using a roberta model with frozen parameters, training only the classification layer. In my case, training more than half a million parameters in the classification layer seems a bit of an overkill for my small data set.
Original RobertaClassificationHead code for reference:
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x