Difference BertModel, AutoModel and AutoModelForMaskedLM

What are the different following codes,

Code 1:

from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")

Code 2:

from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")

Code 3:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

Considering the following image, please explain. It will help for me.

When I search AutoModelForMaskedLM method loading models have head but not sure. I have check layers but all are same. All have 12 layers.

The AutoModel will look at the bert-base-uncased model’s configuration and choose the appropriate base model architecture to use, which in this case is BertModel. So Code 1 and Code 2 will essentially do the same thing, and when you run an inference on either of those models you’ll get the same output, which is the last hidden states from the bert-base-uncased model body.

However! bert-base-uncased was trained using a masked language modelling objective, so the model also has an associated classifier for turning those hidden states into logits that you can use for masked token prediction. In order to do that, you need to instantiate the model with a MaskedLM head, which you can do in a couple ways, including:

  1. You can use BertForMaskedLM
  2. You can use AutoModelForMaskedLM, like you did in Code 3 (which will go find the appropriate <MODEL>ForMaskedLM class).

So then when you run an inference on the resulting model, the outputs have the logits.

Here’s a notebook illustrating it.

In short: With Code 1 and Code 2 you’re instantiating the model without the Head, and Code 3 is instantiating it with the head.

Hope this helps!

2 Likes

Thank you for your information.

I have checked using config but all models in codes 1, 2 and 3 give a same number of hidden layers 12.

BertConfig {
“_name_or_path”: “bert-base-uncased”,
“architectures”: [
“BertForMaskedLM”
],
“attention_probs_dropout_prob”: 0.1,
“classifier_dropout”: null,
“gradient_checkpointing”: false,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“layer_norm_eps”: 1e-12,
“max_position_embeddings”: 512,
“model_type”: “bert”,
“num_attention_heads”: 12,
“num_hidden_layers”: 12,
“pad_token_id”: 0,
“position_embedding_type”: “absolute”,
“transformers_version”: “4.23.1”,
“type_vocab_size”: 2,
“use_cache”: true,
“vocab_size”: 30522
}

Code example:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
print(model.config)

Why it shows the same model architecture? Is there any other method to check? (Without checking the output)

I need to check the architecture.

Also need to know AutoModel and BertModel give Embeddings + Layers + Hidden States or Embeddings + Layers ??

you can try this way:

@mineshj1291

I try to do with your code but I need to input input size to the summery. How can I find it?

Code:

from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1)

Error:

TypeError: summary() missing 1 required positional argument: ‘input_size’

If I put input size, it given following error,

from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1, input_size=(512,))

Error:

RuntimeError: Expected tensor for argument #1 ‘indices’ to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

How did you get it?

you can try upgrading torch-summary and transformers library.
here is colab for the same, if you want to try out. transformers_model_summary.ipynb

Thank you very when I update it, it is working.

When I see AutoModel, It has 3 architectures. Included AutoModelForMaskedLM.

======================================================================
Layer (type:depth-idx) Param #

├─BertEmbeddings: 1-1 –
| └─Embedding: 2-1 23,440,896
| └─Embedding: 2-2 393,216
| └─Embedding: 2-3 1,536
| └─LayerNorm: 2-4 1,536
| └─Dropout: 2-5 –
├─BertEncoder: 1-2 –
| └─ModuleList: 2-6 –
| | └─BertLayer: 3-1 7,087,872
| | └─BertLayer: 3-2 7,087,872
| | └─BertLayer: 3-3 7,087,872
| | └─BertLayer: 3-4 7,087,872
| | └─BertLayer: 3-5 7,087,872
| | └─BertLayer: 3-6 7,087,872
| | └─BertLayer: 3-7 7,087,872
| | └─BertLayer: 3-8 7,087,872
| | └─BertLayer: 3-9 7,087,872
| | └─BertLayer: 3-10 7,087,872
| | └─BertLayer: 3-11 7,087,872
| | └─BertLayer: 3-12 7,087,872
├─BertPooler: 1-3 –
| └─Linear: 2-7 590,592
| └─Tanh: 2-8 –

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

===========================================================================
Layer (type:depth-idx) Param #

├─BertModel: 1-1 –
| └─BertEmbeddings: 2-1 –
| | └─Embedding: 3-1 23,440,896
| | └─Embedding: 3-2 393,216
| | └─Embedding: 3-3 1,536
| | └─LayerNorm: 3-4 1,536
| | └─Dropout: 3-5 –
| └─BertEncoder: 2-2 –
| | └─ModuleList: 3-6 85,054,464
├─BertOnlyMLMHead: 1-2 –
| └─BertLMPredictionHead: 2-3 –
| | └─BertPredictionHeadTransform: 3-7 592,128
| | └─Linear: 3-8 23,471,418

Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0

======================================================================
Layer (type:depth-idx) Param #

├─BertEmbeddings: 1-1 –
| └─Embedding: 2-1 23,440,896
| └─Embedding: 2-2 393,216
| └─Embedding: 2-3 1,536
| └─LayerNorm: 2-4 1,536
| └─Dropout: 2-5 –
├─BertEncoder: 1-2 –
| └─ModuleList: 2-6 –
| | └─BertLayer: 3-1 7,087,872
| | └─BertLayer: 3-2 7,087,872
| | └─BertLayer: 3-3 7,087,872
| | └─BertLayer: 3-4 7,087,872
| | └─BertLayer: 3-5 7,087,872
| | └─BertLayer: 3-6 7,087,872
| | └─BertLayer: 3-7 7,087,872
| | └─BertLayer: 3-8 7,087,872
| | └─BertLayer: 3-9 7,087,872
| | └─BertLayer: 3-10 7,087,872
| | └─BertLayer: 3-11 7,087,872
| | └─BertLayer: 3-12 7,087,872
├─BertPooler: 1-3 –
| └─Linear: 2-7 590,592
| └─Tanh: 2-8 –

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0