With a fresh install of transformers
and pytorch
, I ran the lines of example code from MobileBERT — transformers 4.11.3 documentation
>>> from transformers import MobileBertTokenizer, MobileBertForNextSentencePrediction
>>> import torch
>>> tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
>>> model = MobileBertForNextSentencePrediction.from_pretrained('google/mobilebert-uncased')
>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
>>> outputs = model(**encoding, labels=torch.LongTensor([1]))
>>> loss = outputs.loss
>>> logits = outputs.logits
Printing the logits, we get tensor([[2.7888e+08, 2.7884e+08]], grad_fn=<AddmmBackward>)
For comparison, the logits produced on the same example using BertForNextSentencePrediction with bert-base-uncased instead are tensor([[-3.0729, 5.9056]], grad_fn=<AddmmBackward>)
.
I tried lots of different examples, and got the same strange behavior: logits of about 2e+08 for both classes, and higher for the first class in the 3rd or 4th significant figure. Given the sizes, it leads to a softmax score of 1 “is the next sentence” (the first class) and 0 for the other no matter what the first and second sentence is, no matter how unrelated the second sentence is.
Is there something not in the example code from the documentation that needs to be done in order to get non-degenerate outputs for the Next Sentence Prediction task it was pretrained on?
cc @vshampor