Hi there!
I am fining tuning a pretrained mobilebert model. I loaded the checkpoint and tokenizer from google/mobilebert-uncased"
as suggested in their doc. Specifically, I use their MobileBertModel
I want to extract the embedding of <cls>
of an input sentence from the last layer for downstream task. I assume that the first embedding of the output sequence is associated with <cls>
. However, I found that this first embedding has extremely large value around 1e6~1e7, while the following embedding are quite normal with value around 0. With the current setting, I sometimes got nan
when I fine tune the model on downstream task. I wonder am I doing something silly here? Any suggestion is welcomed. Thanks in advance:)
how to reproduce[just follow the doc]
from transformers import AutoTokenizer, MobileBertModel
import torch
tokenizer = AutoTokenizer.from_pretrained("google/mobilebert-uncased")
model = MobileBertModel.from_pretrained("google/mobilebert-uncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
last_hidden_states = outputs.last_hidden_state
>>> last_hidden_states
>>> tensor([[[-2.5655e+07, 9.8470e+04, 1.6557e+05, ..., -1.6260e+06,
1.2349e+06, 2.6711e+04],
[ 1.9210e-01, 6.7220e-01, -8.1361e-01, ..., 5.5035e-02,
1.4415e+00, 4.5810e+00],
[ 9.1281e-01, 1.9443e+00, 1.5657e+00, ..., -1.2405e-01,
-2.7288e+00, 2.7489e+00],
...,
[ 1.5894e+00, 5.9102e-01, 1.9070e+00, ..., 2.7961e+00,
-2.6210e+00, 3.7704e+00],
[ 1.5775e+00, 3.8555e+00, -6.2034e-01, ..., 2.9500e+00,
-2.2804e+00, 2.9576e+00],
[ 9.0074e-01, 8.5517e-01, 1.1304e+00, ..., 1.1470e+00,
-1.2494e+00, 6.7297e-01]]], grad_fn=<AddBackward0>)
The tokenizer will automatically prepend <cls>
to the input sequence.