What are the different following codes,
Code 1:
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")
Code 2:
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
Code 3:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
Considering the following image, please explain. It will help for me.
When I search AutoModelForMaskedLM
method loading models have head but not sure. I have check layers but all are same. All have 12 layers.
The AutoModel
will look at the bert-base-uncased
modelâs configuration and choose the appropriate base model architecture to use, which in this case is BertModel
. So Code 1 and Code 2 will essentially do the same thing, and when you run an inference on either of those models youâll get the same output, which is the last hidden states from the bert-base-uncased
model body.
However! bert-base-uncased
was trained using a masked language modelling objective, so the model also has an associated classifier for turning those hidden states into logits that you can use for masked token prediction. In order to do that, you need to instantiate the model with a MaskedLM head, which you can do in a couple ways, including:
- You can use BertForMaskedLM
- You can use AutoModelForMaskedLM, like you did in Code 3 (which will go find the appropriate
<MODEL>ForMaskedLM
class).
So then when you run an inference on the resulting model, the outputs have the logits.
Hereâs a notebook illustrating it.
In short: With Code 1 and Code 2 youâre instantiating the model without the Head, and Code 3 is instantiating it with the head.
Hope this helps!
2 Likes
Thank you for your information.
I have checked using config but all models in codes 1, 2 and 3 give a same number of hidden layers 12.
BertConfig {
â_name_or_pathâ: âbert-base-uncasedâ,
âarchitecturesâ: [
âBertForMaskedLMâ
],
âattention_probs_dropout_probâ: 0.1,
âclassifier_dropoutâ: null,
âgradient_checkpointingâ: false,
âhidden_actâ: âgeluâ,
âhidden_dropout_probâ: 0.1,
âhidden_sizeâ: 768,
âinitializer_rangeâ: 0.02,
âintermediate_sizeâ: 3072,
âlayer_norm_epsâ: 1e-12,
âmax_position_embeddingsâ: 512,
âmodel_typeâ: âbertâ,
ânum_attention_headsâ: 12,
ânum_hidden_layersâ: 12,
âpad_token_idâ: 0,
âposition_embedding_typeâ: âabsoluteâ,
âtransformers_versionâ: â4.23.1â,
âtype_vocab_sizeâ: 2,
âuse_cacheâ: true,
âvocab_sizeâ: 30522
}
Code example:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
print(model.config)
Why it shows the same model architecture? Is there any other method to check? (Without checking the output)
I need to check the architecture.
Also need to know AutoModel
and BertModel
give Embeddings + Layers + Hidden States or Embeddings + Layers ??
@mineshj1291
I try to do with your code but I need to input input size
to the summery. How can I find it?
Code:
from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1)
Error:
TypeError: summary() missing 1 required positional argument: âinput_sizeâ
If I put input size, it given following error,
from transformers import AutoModel, AutoModelForMaskedLM
import torchsummary
model1 = AutoModel.from_pretrained("bert-base-uncased")
summery1 = torchsummary.summary(model1, input_size=(512,))
Error:
RuntimeError: Expected tensor for argument #1 âindicesâ to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
How did you get it?
you can try upgrading torch-summary and transformers library.
here is colab for the same, if you want to try out. transformers_model_summary.ipynb
Thank you very when I update it, it is working.
When I see AutoModel, It has 3 architectures. Included AutoModelForMaskedLM.
======================================================================
Layer (type:depth-idx) Param #
ââBertEmbeddings: 1-1 â
| ââEmbedding: 2-1 23,440,896
| ââEmbedding: 2-2 393,216
| ââEmbedding: 2-3 1,536
| ââLayerNorm: 2-4 1,536
| ââDropout: 2-5 â
ââBertEncoder: 1-2 â
| ââModuleList: 2-6 â
| | ââBertLayer: 3-1 7,087,872
| | ââBertLayer: 3-2 7,087,872
| | ââBertLayer: 3-3 7,087,872
| | ââBertLayer: 3-4 7,087,872
| | ââBertLayer: 3-5 7,087,872
| | ââBertLayer: 3-6 7,087,872
| | ââBertLayer: 3-7 7,087,872
| | ââBertLayer: 3-8 7,087,872
| | ââBertLayer: 3-9 7,087,872
| | ââBertLayer: 3-10 7,087,872
| | ââBertLayer: 3-11 7,087,872
| | ââBertLayer: 3-12 7,087,872
ââBertPooler: 1-3 â
| ââLinear: 2-7 590,592
| ââTanh: 2-8 â
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
===========================================================================
Layer (type:depth-idx) Param #
ââBertModel: 1-1 â
| ââBertEmbeddings: 2-1 â
| | ââEmbedding: 3-1 23,440,896
| | ââEmbedding: 3-2 393,216
| | ââEmbedding: 3-3 1,536
| | ââLayerNorm: 3-4 1,536
| | ââDropout: 3-5 â
| ââBertEncoder: 2-2 â
| | ââModuleList: 3-6 85,054,464
ââBertOnlyMLMHead: 1-2 â
| ââBertLMPredictionHead: 2-3 â
| | ââBertPredictionHeadTransform: 3-7 592,128
| | ââLinear: 3-8 23,471,418
Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0
======================================================================
Layer (type:depth-idx) Param #
ââBertEmbeddings: 1-1 â
| ââEmbedding: 2-1 23,440,896
| ââEmbedding: 2-2 393,216
| ââEmbedding: 2-3 1,536
| ââLayerNorm: 2-4 1,536
| ââDropout: 2-5 â
ââBertEncoder: 1-2 â
| ââModuleList: 2-6 â
| | ââBertLayer: 3-1 7,087,872
| | ââBertLayer: 3-2 7,087,872
| | ââBertLayer: 3-3 7,087,872
| | ââBertLayer: 3-4 7,087,872
| | ââBertLayer: 3-5 7,087,872
| | ââBertLayer: 3-6 7,087,872
| | ââBertLayer: 3-7 7,087,872
| | ââBertLayer: 3-8 7,087,872
| | ââBertLayer: 3-9 7,087,872
| | ââBertLayer: 3-10 7,087,872
| | ââBertLayer: 3-11 7,087,872
| | ââBertLayer: 3-12 7,087,872
ââBertPooler: 1-3 â
| ââLinear: 2-7 590,592
| ââTanh: 2-8 â
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0