Bert Model with Different Architectures Understanding and Make Custom Model (architecture) Using Transformer Library

Indramal · November 29, 2022, 4:07am

I am trying to understand each of the heads and bodies of models. Therefire I use torchsummary library to undersatnd it.

Each of the architectural details are shown below. These details are for the BERT model.

NOTE: I am trying to implement something new with language understood model, therefore I am not using available headers like q&a or mask fill etc.

I have the following questions regarding this.

1.Why ModuleList: 3-6 only have each architecture? Automodel have ModuleList: 3-1 to 12. ModuleList: 3-6 means ModuleList: 3-1 to 6 ?

2.How to make custom architecture of BERT? Do I need to use Automodel ? Then how to drop some layers? Example: Automodel have ModuleList: 3-1 to 12 but AutoModelForQuestionAnswering does not have some ModuleList layers.

3.According to Transformer architecture BERT use encoder side. If I use BertEmbeddings: 2-1 and BertEncoder: 2-2 only with ModuleList: 3-6 layers, is it enough to understand the language? (Because the encoder part of Transformer architecture understands the language)

For AutoModel

======================================================================
Layer (type:depth-idx)                        Param #
======================================================================
├─BertEmbeddings: 1-1                         --
|    └─Embedding: 2-1                         23,440,896
|    └─Embedding: 2-2                         393,216
|    └─Embedding: 2-3                         1,536
|    └─LayerNorm: 2-4                         1,536
|    └─Dropout: 2-5                           --
├─BertEncoder: 1-2                            --
|    └─ModuleList: 2-6                        --
|    |    └─BertLayer: 3-1                    7,087,872
|    |    └─BertLayer: 3-2                    7,087,872
|    |    └─BertLayer: 3-3                    7,087,872
|    |    └─BertLayer: 3-4                    7,087,872
|    |    └─BertLayer: 3-5                    7,087,872
|    |    └─BertLayer: 3-6                    7,087,872
|    |    └─BertLayer: 3-7                    7,087,872
|    |    └─BertLayer: 3-8                    7,087,872
|    |    └─BertLayer: 3-9                    7,087,872
|    |    └─BertLayer: 3-10                   7,087,872
|    |    └─BertLayer: 3-11                   7,087,872
|    |    └─BertLayer: 3-12                   7,087,872
├─BertPooler: 1-3                             --
|    └─Linear: 2-7                            590,592
|    └─Tanh: 2-8                              --
======================================================================
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
======================================================================

For AutoModelForQuestionAnswering

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
├─BertModel: 1-1                         --
|    └─BertEmbeddings: 2-1               --
|    |    └─Embedding: 3-1               23,440,896
|    |    └─Embedding: 3-2               393,216
|    |    └─Embedding: 3-3               1,536
|    |    └─LayerNorm: 3-4               1,536
|    |    └─Dropout: 3-5                 --
|    └─BertEncoder: 2-2                  --
|    |    └─ModuleList: 3-6              85,054,464
├─Linear: 1-2                            1,538
=================================================================
Total params: 108,893,186
Trainable params: 108,893,186
Non-trainable params: 0
=================================================================

For AutoModelForMaskedLM

===========================================================================
Layer (type:depth-idx)                             Param #
===========================================================================
├─BertModel: 1-1                                   --
|    └─BertEmbeddings: 2-1                         --
|    |    └─Embedding: 3-1                         23,440,896
|    |    └─Embedding: 3-2                         393,216
|    |    └─Embedding: 3-3                         1,536
|    |    └─LayerNorm: 3-4                         1,536
|    |    └─Dropout: 3-5                           --
|    └─BertEncoder: 2-2                            --
|    |    └─ModuleList: 3-6                        85,054,464
├─BertOnlyMLMHead: 1-2                             --
|    └─BertLMPredictionHead: 2-3                   --
|    |    └─BertPredictionHeadTransform: 3-7       592,128
|    |    └─Linear: 3-8                            23,471,418
===========================================================================
Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0
===========================================================================

For AutoModelForTokenClassification

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
├─BertModel: 1-1                         --
|    └─BertEmbeddings: 2-1               --
|    |    └─Embedding: 3-1               23,440,896
|    |    └─Embedding: 3-2               393,216
|    |    └─Embedding: 3-3               1,536
|    |    └─LayerNorm: 3-4               1,536
|    |    └─Dropout: 3-5                 --
|    └─BertEncoder: 2-2                  --
|    |    └─ModuleList: 3-6              85,054,464
├─Dropout: 1-2                           --
├─Linear: 1-3                            1,538
=================================================================
Total params: 108,893,186
Trainable params: 108,893,186
Non-trainable params: 0
=================================================================

For AutoModelForMultipleChoice

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
├─BertModel: 1-1                         --
|    └─BertEmbeddings: 2-1               --
|    |    └─Embedding: 3-1               23,440,896
|    |    └─Embedding: 3-2               393,216
|    |    └─Embedding: 3-3               1,536
|    |    └─LayerNorm: 3-4               1,536
|    |    └─Dropout: 3-5                 --
|    └─BertEncoder: 2-2                  --
|    |    └─ModuleList: 3-6              85,054,464
|    └─BertPooler: 2-3                   --
|    |    └─Linear: 3-7                  590,592
|    |    └─Tanh: 3-8                    --
├─Dropout: 1-2                           --
├─Linear: 1-3                            769
=================================================================
Total params: 109,483,009
Trainable params: 109,483,009
Non-trainable params: 0
=================================================================

For AutoModelForCausalLM

===========================================================================
Layer (type:depth-idx)                             Param #
===========================================================================
├─BertModel: 1-1                                   --
|    └─BertEmbeddings: 2-1                         --
|    |    └─Embedding: 3-1                         23,440,896
|    |    └─Embedding: 3-2                         393,216
|    |    └─Embedding: 3-3                         1,536
|    |    └─LayerNorm: 3-4                         1,536
|    |    └─Dropout: 3-5                           --
|    └─BertEncoder: 2-2                            --
|    |    └─ModuleList: 3-6                        85,054,464
├─BertOnlyMLMHead: 1-2                             --
|    └─BertLMPredictionHead: 2-3                   --
|    |    └─BertPredictionHeadTransform: 3-7       592,128
|    |    └─Linear: 3-8                            23,471,418
===========================================================================
Total params: 132,955,194
Trainable params: 132,955,194
Non-trainable params: 0
===========================================================================

For AutoModelForSequenceClassification

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
├─BertModel: 1-1                         --
|    └─BertEmbeddings: 2-1               --
|    |    └─Embedding: 3-1               23,440,896
|    |    └─Embedding: 3-2               393,216
|    |    └─Embedding: 3-3               1,536
|    |    └─LayerNorm: 3-4               1,536
|    |    └─Dropout: 3-5                 --
|    └─BertEncoder: 2-2                  --
|    |    └─ModuleList: 3-6              85,054,464
|    └─BertPooler: 2-3                   --
|    |    └─Linear: 3-7                  590,592
|    |    └─Tanh: 3-8                    --
├─Dropout: 1-2                           --
├─Linear: 1-3                            1,538
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
=================================================================

For AutoModelForMultipleChoice

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
├─BertModel: 1-1                         --
|    └─BertEmbeddings: 2-1               --
|    |    └─Embedding: 3-1               23,440,896
|    |    └─Embedding: 3-2               393,216
|    |    └─Embedding: 3-3               1,536
|    |    └─LayerNorm: 3-4               1,536
|    |    └─Dropout: 3-5                 --
|    └─BertEncoder: 2-2                  --
|    |    └─ModuleList: 3-6              85,054,464
|    └─BertPooler: 2-3                   --
|    |    └─Linear: 3-7                  590,592
|    |    └─Tanh: 3-8                    --
├─Dropout: 1-2                           --
├─Linear: 1-3                            769
=================================================================
Total params: 109,483,009
Trainable params: 109,483,009
Non-trainable params: 0
=================================================================

Topic		Replies	Views
How to use AutoModel Beginners	0	1996	May 4, 2021
Difference BertModel, AutoModel and AutoModelForMaskedLM 🤗Transformers	8	5022	March 9, 2025
Custom Tasks and BERT Fine Tuning Beginners	4	5000	October 30, 2020
How to upload a modified architecture of a BERT model Models	0	238	August 25, 2023
How to add additional module to BERT architecture, then load the original weight and use it 🤗Transformers	0	461	May 20, 2022

Bert Model with Different Architectures Understanding and Make Custom Model (architecture) Using Transformer Library

Related topics