Naming inconsistency in Distilbert config

I noticed one inconsistency between Distilbert and bert configs. Distilbert config stores output hidden size as hidden_size and ffn dim as hidden_dim while BERT and RoBERTa use hidden_dim for output and intermediate_size for ffn dim.

I know such thing can be hard to fix without breaking backcompatibiliby, but such behavior makes is a bit harder to get your model output size upfront.
E.g. if I want to be able to use both DistilBERT and BERT as an encoder in my model like this

class MySuperCustomModel(nn.Module):
    def __init__(self, encoder, n_classes):
        self.encoder = encoder
        hidden_size = ...  # I wish it would be as simple as encoder.config.hidden_dim
        self.logit_network = nn.Linear(hidden_size, n_classes)

the code to get the encoder output size is kind of ugly, because you need to use isinstance or something like it.
Or course, for classification you can use *ModelForClassification, but what if you want to use a pre-trained model as a seq2seq encoder or to write some other custom model.

I feel like solving this issue can make quite some people a bit happier, as they would be able to have to experiment with using different pre-trained models without code modifications and without thinking about the differences between transformer configs.

Is there a better way to get the output dimension of the model or any fixes planned? I can help with a PR too.

You can make a PR with new properties for those configs (like hidden_size for DistilBert) but we can’t change the name of the arguments of the configs as it would be a severe breaking change.
I agree that consistent named properties would be useful!