BertForMaskedLM’s loss and scores, how the loss is computed?

I have a simple MaskedLM model with one masked token at position 7. The model returns 20.2516 and 18.0698 as loss and score respectively. However, not sure how the loss is computed from the score. I assumed the loss should be
loss = - log(softmax(score[prediction])
but computing this loss returns 0.0002. I’m confused about how the loss is computed in the model.

import copy
from transformers import BertForMaskedLM, BertTokenizerFast
import torch
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')


text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer.encode_plus(text,  return_tensors="pt", add_special_tokens = True, truncation=True, pad_to_max_length = True,
                                         return_attention_mask = True,  max_length=64)
input_ids = inputs['input_ids']
masked  = copy.deepcopy(inputs['input_ids'])
masked[0][7] = 103
for t in range(len(masked[0])):
  if masked[0][t] != 103:
    masked[0][t] = -100
loss, scores = model(input_ids = input_ids, attention_mask = inputs['attention_mask'] , token_type_ids=inputs['token_type_ids'] , labels=masked)
print('loss',loss)
print(scores.shape)
pred = torch.argmax( scores[0][7]).item()
print("predicted token:", pred, tokenizer.convert_ids_to_tokens([pred])  )
print("score:", scores[0][7][pred]) 


logSoftmax = torch.nn.LogSoftmax(dim=1)
NLLLos = torch.nn.NLLLoss()
output = NLLLos( logSoftmax(torch.unsqueeze(logit[0][7], 0)), torch.tensor([pred]))
print(output)

Hi @sanaz,

I can see few mistakes here

  1. You need to mask tokens in the input_ids not labels. And to prepare lables for masked LM set every position to -100 (ignore index) except the masked positions.
  2. masked loss is then calculated simply using the CrossEntropy loss between the logits and labels.

So correct usage would be

text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer([text],  return_tensors="pt")

input_ids = inputs["input_ids"]
# mask the token
input_ids[0][7] = tokenizer.mask_token_id

labels = inputs["input_ids"].clone()
labels[labels != tokenizer.mask_token_id] = -100 # only calculate loss on masked tokens

loss, logits = model(
    input_ids=input_ids,
    labels=labels,
    attention_mask=inputs["attention_mask"],
    token_type_ids=inputs["token_type_ids"]
)
# loss => 18.2054

# calculate loss manually
import torch.nn.functional as F
loss2 = F.cross_entropy(logits.view(-1, tokenizer.vocab_size), labels.view(-1))
# loss2 => 18.2054

Hope this helps.

2 Likes

Thanks a lot @valhalla for your reply. You’re right, I didn’t mask the tokens in input ids which is a mistake.
I also found a small mistake in your code, I think the label should be -100 everywhere expect the tokens which are masked in input ids. For those tokens, the label should have the correct id (and not mask/103) so the model knows what is the actual token. In your code, the model predicts Paterson as the correct answer, however, based on the label it thinks the correct token is actually masked token ([MASK]).

I made a small change to the code and now it works. Now the loss is 0.0056 which makes sense for a correct prediction.

import copy
from transformers import BertForMaskedLM, BertTokenizerFast
import torch
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer.encode_plus(text,  return_tensors="pt", add_special_tokens = True, truncation=True, pad_to_max_length = True,
                                         return_attention_mask = True,  max_length=64)
input_ids = inputs['input_ids']
labels  = copy.deepcopy(input_ids) #this is the part I changed
input_ids[0][7] = tokenizer.mask_token_id
labels[input_ids != tokenizer.mask_token_id] = -100 

loss, scores = model(input_ids = input_ids, attention_mask = inputs['attention_mask'] , token_type_ids=inputs['token_type_ids'] , labels=labels)
print('loss',loss)
pred = torch.argmax( scores[0][7]).item()
print("predicted token:", pred, tokenizer.convert_ids_to_tokens([pred])  )
print(NLLLos( logSoftmax(torch.unsqueeze(scores[0][7], 0)), torch.tensor([pred]))) #the same as F.cross_entropy(scores.view(-1, tokenizer.vocab_size), labels.view(-1))
1 Like