Next sentence prediction with google/mobilebert-uncased producing massive, near-identical logits > 10^8 for its documentation example (and >2k others tried)

With a fresh install of transformers and pytorch, I ran the lines of example code from MobileBERT — transformers 4.11.3 documentation

>>> from transformers import MobileBertTokenizer, MobileBertForNextSentencePrediction
>>> import torch

>>> tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
>>> model = MobileBertForNextSentencePrediction.from_pretrained('google/mobilebert-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')

>>> outputs = model(**encoding, labels=torch.LongTensor([1]))
>>> loss = outputs.loss
>>> logits = outputs.logits

Printing the logits, we get tensor([[2.7888e+08, 2.7884e+08]], grad_fn=<AddmmBackward>)
For comparison, the logits produced on the same example using BertForNextSentencePrediction with bert-base-uncased instead are tensor([[-3.0729, 5.9056]], grad_fn=<AddmmBackward>).

I tried lots of different examples, and got the same strange behavior: logits of about 2e+08 for both classes, and higher for the first class in the 3rd or 4th significant figure. Given the sizes, it leads to a softmax score of 1 “is the next sentence” (the first class) and 0 for the other no matter what the first and second sentence is, no matter how unrelated the second sentence is.

Is there something not in the example code from the documentation that needs to be done in order to get non-degenerate outputs for the Next Sentence Prediction task it was pretrained on?

cc @vshampor

Linked issue: Logit explosion in MobileBertForNextSentencePrediction example from documentation (and all others tried) · Issue #13990 · huggingface/transformers · GitHub