I noticed one inconsistency between Distilbert and bert configs. Distilbert config stores output hidden size as
hidden_size and ffn dim as
hidden_dim while BERT and RoBERTa use
hidden_dim for output and
intermediate_size for ffn dim.
I know such thing can be hard to fix without breaking backcompatibiliby, but such behavior makes is a bit harder to get your model output size upfront.
E.g. if I want to be able to use both DistilBERT and BERT as an encoder in my model like this
class MySuperCustomModel(nn.Module): def __init__(self, encoder, n_classes): super().__init__() self.encoder = encoder hidden_size = ... # I wish it would be as simple as encoder.config.hidden_dim self.logit_network = nn.Linear(hidden_size, n_classes)
the code to get the encoder output size is kind of ugly, because you need to use
isinstance or something like it.
Or course, for classification you can use *ModelForClassification, but what if you want to use a pre-trained model as a seq2seq encoder or to write some other custom model.
I feel like solving this issue can make quite some people a bit happier, as they would be able to have to experiment with using different pre-trained models without code modifications and without thinking about the differences between transformer configs.
Is there a better way to get the output dimension of the model or any fixes planned? I can help with a PR too.