Difference BertModel, AutoModel and AutoModelForMaskedLM

Indramal · October 12, 2022, 2:16pm

What are the different following codes,

Code 1:

from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")

Code 2:

from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")

Code 3:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

Considering the following image, please explain. It will help for me.

When I search AutoModelForMaskedLM method loading models have head but not sure. I have check layers but all are same. All have 12 layers.

NimaBoscarino · October 12, 2022, 6:06pm

The AutoModel will look at the bert-base-uncased model’s configuration and choose the appropriate base model architecture to use, which in this case is BertModel. So Code 1 and Code 2 will essentially do the same thing, and when you run an inference on either of those models you’ll get the same output, which is the last hidden states from the bert-base-uncased model body.

However! bert-base-uncased was trained using a masked language modelling objective, so the model also has an associated classifier for turning those hidden states into logits that you can use for masked token prediction. In order to do that, you need to instantiate the model with a MaskedLM head, which you can do in a couple ways, including:

You can use BertForMaskedLM
You can use AutoModelForMaskedLM, like you did in Code 3 (which will go find the appropriate <MODEL>ForMaskedLM class).

So then when you run an inference on the resulting model, the outputs have the logits.

Here’s a notebook illustrating it.

In short: With Code 1 and Code 2 you’re instantiating the model without the Head, and Code 3 is instantiating it with the head.

Hope this helps!

Indramal · October 13, 2022, 3:32am

Thank you for your information.

I have checked using config but all models in codes 1, 2 and 3 give a same number of hidden layers 12.

BertConfig {
“_name_or_path”: “bert-base-uncased”,
“architectures”: [
“BertForMaskedLM”
],
“attention_probs_dropout_prob”: 0.1,
“classifier_dropout”: null,
“gradient_checkpointing”: false,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“layer_norm_eps”: 1e-12,
“max_position_embeddings”: 512,
“model_type”: “bert”,
“num_attention_heads”: 12,
“num_hidden_layers”: 12,
“pad_token_id”: 0,
“position_embedding_type”: “absolute”,
“transformers_version”: “4.23.1”,
“type_vocab_size”: 2,
“use_cache”: true,
“vocab_size”: 30522
}

Code example:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
print(model.config)

Why it shows the same model architecture? Is there any other method to check? (Without checking the output)

I need to check the architecture.

Indramal · October 13, 2022, 5:23am

Also need to know AutoModel and BertModel give Embeddings + Layers + Hidden States or Embeddings + Layers ??

mineshj1291 · October 13, 2022, 5:38am

you can try this way:

Indramal · October 13, 2022, 8:40am

@mineshj1291

I try to do with your code but I need to input input size to the summery. How can I find it?

Code:

from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1)

Error:

TypeError: summary() missing 1 required positional argument: ‘input_size’

If I put input size, it given following error,

from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1, input_size=(512,))

Error:

RuntimeError: Expected tensor for argument #1 ‘indices’ to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

How did you get it?

mineshj1291 · October 13, 2022, 1:08pm

you can try upgrading torch-summary and transformers library.
here is colab for the same, if you want to try out. transformers_model_summary.ipynb

Indramal · October 13, 2022, 2:08pm

Thank you very when I update it, it is working.

When I see AutoModel, It has 3 architectures. Included AutoModelForMaskedLM.

======================================================================
Layer (type:depth-idx) Param #

├─BertEmbeddings: 1-1 –
| └─Embedding: 2-1 23,440,896
| └─Embedding: 2-2 393,216
| └─Embedding: 2-3 1,536
| └─LayerNorm: 2-4 1,536
| └─Dropout: 2-5 –
├─BertEncoder: 1-2 –
| └─ModuleList: 2-6 –
| | └─BertLayer: 3-1 7,087,872
| | └─BertLayer: 3-2 7,087,872
| | └─BertLayer: 3-3 7,087,872
| | └─BertLayer: 3-4 7,087,872
| | └─BertLayer: 3-5 7,087,872
| | └─BertLayer: 3-6 7,087,872
| | └─BertLayer: 3-7 7,087,872
| | └─BertLayer: 3-8 7,087,872
| | └─BertLayer: 3-9 7,087,872
| | └─BertLayer: 3-10 7,087,872
| | └─BertLayer: 3-11 7,087,872
| | └─BertLayer: 3-12 7,087,872
├─BertPooler: 1-3 –
| └─Linear: 2-7 590,592
| └─Tanh: 2-8 –

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

===========================================================================
Layer (type:depth-idx) Param #

├─BertModel: 1-1 –
| └─BertEmbeddings: 2-1 –
| | └─Embedding: 3-1 23,440,896
| | └─Embedding: 3-2 393,216
| | └─Embedding: 3-3 1,536
| | └─LayerNorm: 3-4 1,536
| | └─Dropout: 3-5 –
| └─BertEncoder: 2-2 –
| | └─ModuleList: 3-6 85,054,464
├─BertOnlyMLMHead: 1-2 –
| └─BertLMPredictionHead: 2-3 –
| | └─BertPredictionHeadTransform: 3-7 592,128
| | └─Linear: 3-8 23,471,418

Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0

======================================================================
Layer (type:depth-idx) Param #

├─BertEmbeddings: 1-1 –
| └─Embedding: 2-1 23,440,896
| └─Embedding: 2-2 393,216
| └─Embedding: 2-3 1,536
| └─LayerNorm: 2-4 1,536
| └─Dropout: 2-5 –
├─BertEncoder: 1-2 –
| └─ModuleList: 2-6 –
| | └─BertLayer: 3-1 7,087,872
| | └─BertLayer: 3-2 7,087,872
| | └─BertLayer: 3-3 7,087,872
| | └─BertLayer: 3-4 7,087,872
| | └─BertLayer: 3-5 7,087,872
| | └─BertLayer: 3-6 7,087,872
| | └─BertLayer: 3-7 7,087,872
| | └─BertLayer: 3-8 7,087,872
| | └─BertLayer: 3-9 7,087,872
| | └─BertLayer: 3-10 7,087,872
| | └─BertLayer: 3-11 7,087,872
| | └─BertLayer: 3-12 7,087,872
├─BertPooler: 1-3 –
| └─Linear: 2-7 590,592
| └─Tanh: 2-8 –

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

Topic		Replies	Views
How to use AutoModel Beginners	0	1881	May 4, 2021
Should I use BertModel or BertModelForLM? Beginners	2	409	February 10, 2022
Empty BERT Model, any help? Beginners	2	420	January 5, 2024
Should I use BertConfig? Why these output are different? Beginners	1	509	February 11, 2022
Difference between "Auto Model" and "Auto Model For Token Classification" in BERT fine tuning 🤗Transformers	1	1717	June 25, 2022

Difference BertModel, AutoModel and AutoModelForMaskedLM

======================================================================
Layer (type:depth-idx) Param #

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

===========================================================================
Layer (type:depth-idx) Param #

Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0

======================================================================
Layer (type:depth-idx) Param #

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

Difference BertModel, AutoModel and AutoModelForMaskedLM

====================================================================== Layer (type:depth-idx) Param #

Total params: 109,482,240 Trainable params: 109,482,240 Non-trainable params: 0

=========================================================================== Layer (type:depth-idx) Param #

Total params: 132,955,194 Trainable params: 132,955,194 Non-trainable params: 0

====================================================================== Layer (type:depth-idx) Param #

Total params: 109,482,240 Trainable params: 109,482,240 Non-trainable params: 0

Related topics

======================================================================
Layer (type:depth-idx) Param #

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0

===========================================================================
Layer (type:depth-idx) Param #

Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0

======================================================================
Layer (type:depth-idx) Param #

Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0