Abnormal large value of MobileBert's <cls> embed

noah822 · November 1, 2023, 1:44pm

Hi there!
I am fining tuning a pretrained mobilebert model. I loaded the checkpoint and tokenizer from google/mobilebert-uncased" as suggested in their doc. Specifically, I use their MobileBertModel

I want to extract the embedding of <cls> of an input sentence from the last layer for downstream task. I assume that the first embedding of the output sequence is associated with <cls>. However, I found that this first embedding has extremely large value around 1e6~1e7, while the following embedding are quite normal with value around 0. With the current setting, I sometimes got nan when I fine tune the model on downstream task. I wonder am I doing something silly here? Any suggestion is welcomed. Thanks in advance:)

how to reproduce[just follow the doc]

from transformers import AutoTokenizer, MobileBertModel
import torch
tokenizer = AutoTokenizer.from_pretrained("google/mobilebert-uncased")
model = MobileBertModel.from_pretrained("google/mobilebert-uncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
last_hidden_states = outputs.last_hidden_state

>>> last_hidden_states
>>> tensor([[[-2.5655e+07,  9.8470e+04,  1.6557e+05,  ..., -1.6260e+06,
           1.2349e+06,  2.6711e+04],
         [ 1.9210e-01,  6.7220e-01, -8.1361e-01,  ...,  5.5035e-02,
           1.4415e+00,  4.5810e+00],
         [ 9.1281e-01,  1.9443e+00,  1.5657e+00,  ..., -1.2405e-01,
          -2.7288e+00,  2.7489e+00],
         ...,
         [ 1.5894e+00,  5.9102e-01,  1.9070e+00,  ...,  2.7961e+00,
          -2.6210e+00,  3.7704e+00],
         [ 1.5775e+00,  3.8555e+00, -6.2034e-01,  ...,  2.9500e+00,
          -2.2804e+00,  2.9576e+00],
         [ 9.0074e-01,  8.5517e-01,  1.1304e+00,  ...,  1.1470e+00,
          -1.2494e+00,  6.7297e-01]]], grad_fn=<AddBackward0>)

The tokenizer will automatically prepend <cls> to the input sequence.

Topic		Replies	Views
Next sentence prediction with google/mobilebert-uncased producing massive, near-identical logits > 10^8 for its documentation example (and >2k others tried) 🤗Hub	1	818	October 19, 2021
Identical CLS token embeddings for all different sentences? Beginners	1	451	April 17, 2023
Sentence Embeddings From Fine-Tuned BERTForSequenceClassification 🤗Transformers	1	1676	September 29, 2021
DistilBERT and CLS token Beginners	2	2444	February 21, 2021
Embeddings in yieldBERT Beginners	0	172	January 25, 2024

Abnormal large value of MobileBert's <cls> embed

Related topics