Why run_glue.py does change the Tiny BERT Model?

j35t3r · February 10, 2024, 3:48pm

I am fine-tuning a tiny-bert model with run_glue:

export dataset="sst2"

python run_glue.py \
  --model_name_or_path prajjwal1/bert-tiny \
  --task_name ${dataset} \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 30 \
  --output_dir output/${dataset}

and observed that the last part (Dropout and Classifier) of the Tiny BERT model is removed after loading it.

from transformers import AutoModel, BertForSequenceClassification 
model_name = "prajjwal1/bert-tiny"
model = BertForSequenceClassification .from_pretrained(model_name )
print("Original:", model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=128, out_features=512, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=512, out_features=128, bias=True)
            (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=128, out_features=128, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=128, out_features=2, bias=True)
)

model_name_path = "output/sst2/" # from run_glue.py trained
model = AutoModel.from_pretrained(model_name_path)
print("New:", model)
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 128, padding_idx=0)
    (position_embeddings): Embedding(512, 128)
    (token_type_embeddings): Embedding(2, 128)
    (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-1): 2 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=128, out_features=128, bias=True)
            (key): Linear(in_features=128, out_features=128, bias=True)
            (value): Linear(in_features=128, out_features=128, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=128, out_features=128, bias=True)
            (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=128, out_features=512, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=512, out_features=128, bias=True)
          (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=128, out_features=128, bias=True)
    (activation): Tanh()
  )
)

Why is the last part removed and what does it mean? I mean SST2 should classify a sequence and run_glue.py seems to remove that with AutoModel.

On the model’s page prajjwal1/bert-tiny · Some weights of BertForSequenceClassification were not initialized from the model it is also loaded on AutoModel. Shouldn’t it be loaded as BertForSequenceClassification ?

Topic		Replies	Views
Run my own model on GLUE tasks 🤗Transformers	0	243	August 8, 2021
How do i take only "BERT" weights from BertForSequenceClassification model? 🤗Transformers	0	1446	February 16, 2022
After loading minilm, if I print the model it still shows as BertModel 🤗Transformers	0	267	April 1, 2022
Difference between "Auto Model" and "Auto Model For Token Classification" in BERT fine tuning 🤗Transformers	1	1782	June 25, 2022
How do I change the classification head of a model? 🤗Transformers	31	53250	November 14, 2024

Why run_glue.py does change the Tiny BERT Model?

Related topics