Generate logits from hidden state embeddings and decoder weights

aamini · July 27, 2021, 4:24pm

Hi,

I am trying to compute prediction_logits using BertForPreTraining model. For some reason, I don’t want to use outputs.prediction_logits and I want to be able to generate them by multiplying the last hidden state with decoder weights. The problem is that when I do this the results I get are not equal to outputs.prediction_logits. Here is the code:

model = BertForPreTraining.from_pretrained("bert-base-multilingual-cased", output_hidden_states=True).to(device)

w = model.state_dict()['cls.predictions.decoder.weight'].cpu().numpy()
b = model.state_dict()['cls.predictions.decoder.bias'].cpu().numpy()

with torch.no_grad():
    outputs = model(**inputs)

output_logits = outputs.prediction_logits.cpu().numpy()
last_hidden_states = outputs.hidden_states[-1].cpu().numpy()

preds = output_logits[i, token_idx]
h = last_hidden_states[i, token_idx]
h_transformed = np.dot(w, h) + b

Basically, I expect h_transformed to be equal to preds, but it is not.

Thanks for your help

ehalit · July 28, 2021, 4:41am

I suspect dropout might be to blame here. You can create a small fully connected layer with dropout, and initialize it with decoder weights to use instead.

aamini · July 28, 2021, 9:15am

Thanks for your reply. I tried that but it is still not similar.

ehalit · July 28, 2021, 9:29am

The two outputs will not be similar, because the dropout randomly affects the output. I do not insist that the cause is dropout but if it is, you would get the same training procedure by modeling it this way. You can verify it by calling model.eval() to disable dropout.

aamini · July 28, 2021, 4:36pm

I suspect that’s not the problem, since the dropout shouldn’t be applied after getting the outputs from the last layer.

Topic		Replies	Views
Question about last_hidden_state of the bert model Beginners	0	330	December 7, 2023
BertForPretraining hidden_states extraction with input embeddings as inputs Models	0	397	June 4, 2022
For tuning a classifier head on a pretrained BERT should I use `last_hidden_state` or `outputs[0][:, 0, :]` from the BERT? Beginners	0	178	February 15, 2024
Masked vectors are included in vanilla transformer model 🤗Transformers	1	536	May 17, 2021
How to yield hidden_states from a saved, fine-tuned (distil)bert model? 🤗Transformers	2	401	July 12, 2020

Generate logits from hidden state embeddings and decoder weights

Related topics