Hi!
I noticed one inconsistency between Distilbert and bert configs. Distilbert config stores output hidden size as hidden_size
and ffn dim as hidden_dim
while BERT and RoBERTa use hidden_dim
for output and intermediate_size
for ffn dim.
I know such thing can be hard to fix without breaking backcompatibiliby, but such behavior makes is a bit harder to get your model output size upfront.
E.g. if I want to be able to use both DistilBERT and BERT as an encoder in my model like this
class MySuperCustomModel(nn.Module):
def __init__(self, encoder, n_classes):
super().__init__()
self.encoder = encoder
hidden_size = ... # I wish it would be as simple as encoder.config.hidden_dim
self.logit_network = nn.Linear(hidden_size, n_classes)
the code to get the encoder output size is kind of ugly, because you need to use isinstance
or something like it.
Or course, for classification you can use *ModelForClassification, but what if you want to use a pre-trained model as a seq2seq encoder or to write some other custom model.
I feel like solving this issue can make quite some people a bit happier, as they would be able to have to experiment with using different pre-trained models without code modifications and without thinking about the differences between transformer configs.
Is there a better way to get the output dimension of the model or any fixes planned? I can help with a PR too.