BertForMaskedLM’s loss and scores, how the loss is computed?

sanaz · August 5, 2020, 5:54pm

I have a simple MaskedLM model with one masked token at position 7. The model returns 20.2516 and 18.0698 as loss and score respectively. However, not sure how the loss is computed from the score. I assumed the loss should be
loss = - log(softmax(score[prediction])
but computing this loss returns 0.0002. I’m confused about how the loss is computed in the model.

import copy
from transformers import BertForMaskedLM, BertTokenizerFast
import torch
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')


text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer.encode_plus(text,  return_tensors="pt", add_special_tokens = True, truncation=True, pad_to_max_length = True,
                                         return_attention_mask = True,  max_length=64)
input_ids = inputs['input_ids']
masked  = copy.deepcopy(inputs['input_ids'])
masked[0][7] = 103
for t in range(len(masked[0])):
  if masked[0][t] != 103:
    masked[0][t] = -100
loss, scores = model(input_ids = input_ids, attention_mask = inputs['attention_mask'] , token_type_ids=inputs['token_type_ids'] , labels=masked)
print('loss',loss)
print(scores.shape)
pred = torch.argmax( scores[0][7]).item()
print("predicted token:", pred, tokenizer.convert_ids_to_tokens([pred])  )
print("score:", scores[0][7][pred]) 


logSoftmax = torch.nn.LogSoftmax(dim=1)
NLLLos = torch.nn.NLLLoss()
output = NLLLos( logSoftmax(torch.unsqueeze(logit[0][7], 0)), torch.tensor([pred]))
print(output)

valhalla · August 6, 2020, 3:38pm

Hi @sanaz,

I can see few mistakes here

You need to mask tokens in the input_ids not labels. And to prepare lables for masked LM set every position to -100 (ignore index) except the masked positions.
masked loss is then calculated simply using the CrossEntropy loss between the logits and labels.

So correct usage would be

text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer([text],  return_tensors="pt")

input_ids = inputs["input_ids"]
# mask the token
input_ids[0][7] = tokenizer.mask_token_id

labels = inputs["input_ids"].clone()
labels[labels != tokenizer.mask_token_id] = -100 # only calculate loss on masked tokens

loss, logits = model(
    input_ids=input_ids,
    labels=labels,
    attention_mask=inputs["attention_mask"],
    token_type_ids=inputs["token_type_ids"]
)
# loss => 18.2054

# calculate loss manually
import torch.nn.functional as F
loss2 = F.cross_entropy(logits.view(-1, tokenizer.vocab_size), labels.view(-1))
# loss2 => 18.2054

Hope this helps.

sanaz · August 10, 2020, 11:29pm

Thanks a lot @valhalla for your reply. You’re right, I didn’t mask the tokens in input ids which is a mistake.
I also found a small mistake in your code, I think the label should be -100 everywhere expect the tokens which are masked in input ids. For those tokens, the label should have the correct id (and not mask/103) so the model knows what is the actual token. In your code, the model predicts Paterson as the correct answer, however, based on the label it thinks the correct token is actually masked token ([MASK]).

I made a small change to the code and now it works. Now the loss is 0.0056 which makes sense for a correct prediction.

import copy
from transformers import BertForMaskedLM, BertTokenizerFast
import torch
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

text = "Who was Jim Paterson ? Jim Paterson is a doctor".lower()
inputs  =  tokenizer.encode_plus(text,  return_tensors="pt", add_special_tokens = True, truncation=True, pad_to_max_length = True,
                                         return_attention_mask = True,  max_length=64)
input_ids = inputs['input_ids']
labels  = copy.deepcopy(input_ids) #this is the part I changed
input_ids[0][7] = tokenizer.mask_token_id
labels[input_ids != tokenizer.mask_token_id] = -100 

loss, scores = model(input_ids = input_ids, attention_mask = inputs['attention_mask'] , token_type_ids=inputs['token_type_ids'] , labels=labels)
print('loss',loss)
pred = torch.argmax( scores[0][7]).item()
print("predicted token:", pred, tokenizer.convert_ids_to_tokens([pred])  )
print(NLLLos( logSoftmax(torch.unsqueeze(scores[0][7], 0)), torch.tensor([pred]))) #the same as F.cross_entropy(scores.view(-1, tokenizer.vocab_size), labels.view(-1))

Maria · October 26, 2020, 10:43am

Hello,

I am completely beginner in NNs and I am currently facing the same problem.
I am trying to implement my own loss function for BERT Masked LM.
So this part of the code is the most useful for my case:


loss2 = F.cross_entropy(logits.view(-1, tokenizer.vocab_size), labels.view(-1))

However, I do not understand how I can calculate the cross entropy loss from logits and masked token ID. How do we get the information which word was originally masked? This information is completely overlooked when calling the cross entropy function. Am I missing something?

valhalla · October 29, 2020, 5:49pm

To only calculate loss on masked tokens replace non-mask tokens with -100 and the cross entropy loss will ignore those tokens.

reSearch2vec · November 12, 2020, 8:43pm

Thanks for the explanations.

I’m looking at the code to further understand how the loss is calculated under the hood.

In the Bert paper it mentions

the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary,

I’m looking at the code, and it seems to have components not mentioned in the paper.

I see that the hidden states first go through a transformation

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L523

And then I see that there is an entire separate matrix each of the tokens, here

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L547

which is multiplied by all the hidden states → +bias.

Which makes sense since two seperate sets of embeddings (one for the inputs, one for the labels) is how word2vec was trained.

But I didn’t see these components in the paper. I’m guessing that they were copied the original Tensorflow version of Bert?

Edit:

I’m looking at the original Tensorflow version of Bert, and they use the same embedding matrix for the inputs and the labels

github dot com /google-research/bert/blob/master/run_pretraining.py#L141

github dot com / google-research/bert/blob/master/modeling.py#L409

nvidia also uses one embedding matrix for both

github dot com /NVIDIA/DeepLearningExamples/blob/ad49eae34e97b533a7fefb1c5ec10bbfa5cc4256/PyTorch/LanguageModeling/BERT/modeling.py#L551

sanaz · November 13, 2020, 5:04am

Hi @reSearch2vec and @Maria

BERT will actually predict all the tokens (everything, masked, and non-masked tokens). This is why we set the non-masked tokens equal to -100. This means not to compute loss for the non-masked tokens. the reason is the cross-entropy function ignores the inputs which are equal to -100, see here
Also, you can see this code for the code for pre-training the BERT model and understanding how the masking works.

ayalaall · January 19, 2021, 10:56am

Hi,

I am also interested in this topic.
I have a question about how to generate the mask “[MASK]” for the masked token
input_ids[0][7] = tokenizer.mask_token_id

I was wondering if there is a function that generates “[MASK]” on 15% of the tokens (more accurately generates “[MASK]” on 80% out of the 15%, replaces with random tokens 10% out of the 15%, leaves the tokens as they are in 10% out of the 15%) ?
Or I need to write this function myself?
Thanks,
Ayala

valhalla · January 20, 2021, 7:55am

hi @ayalaall

I was wondering if there is a function that generates “[MASK]” on 15% of the tokens

DataCollatorForLanguageModeling does the masking as described, you can find it here

github.com

huggingface/transformers/blob/main/src/transformers/data/data_collator.py#L285


      
                  among:
          
                  - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
                    sequence is provided).
                  - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
                    acceptable input length for the model if that argument is not provided.
                  - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
              max_length (`int`, *optional*):
                  Maximum length of the returned list and optionally padding length (see above).
              pad_to_multiple_of (`int`, *optional*):
                  If set will pad the sequence to a multiple of the provided value.
          
                  This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
                  7.5 (Volta).
              label_pad_token_id (`int`, *optional*, defaults to -100):
                  The id to use when padding the labels (-100 will be automatically ignore by PyTorch loss functions).
              return_tensors (`str`, *optional*, defaults to `"pt"`):
                  The type of Tensor to return. Allowable values are "np", "pt" and "tf".
          """
          
          tokenizer: PreTrainedTokenizerBase

ayalaall · January 21, 2021, 8:00am

Thanks! I will take a look.

ryuh · September 22, 2021, 7:11pm

Hi @valhalla , thanks for the explanation. I am also interested in this topic. Would you take a look at my question? Thanks in advance.

what @sanaz wanted to point out is that:
the ground-truth label based on your code:
[-100, -100, …, 103, …, -100 ]

the ground-truth label based on sanaz’s modified code:
[-100, -100, …, token_id(7th token), …, -100 ]

I know that tokens represented by -100 will be ignored, but for the token (in this case, 7th token) to be predicted, we still use ‘103’ as the ground-truth label?

dnnstudent · October 26, 2021, 6:17am

Thanks for the colab link! Would it be possible to share the dataset used in training for replicating results?

brando · March 31, 2022, 1:57pm

What is the formal defintion of “masked langauge model loss”? Is it just masking the input and predicting the masked token?

simpleParadox · September 22, 2023, 5:41pm

As far as I understand, this is the correct interpretation of MLM. The idea is to predict the masked token from the unmasked tokens. This is why a token is masked in the input but for the labels, the correct token is present at the position where it was masked in the input_ids.

Topic		Replies	Views
Batched BertForMaskedLM inference loss issue Intermediate	0	689	February 23, 2022
BertForMaskedLM train 🤗Transformers	2	784	January 20, 2021
BertForMaskedLM training from scratch 🤗Transformers	0	1039	April 7, 2023
How to get the index of the masked token after passing the sentence to the model 🤗Transformers	3	2820	September 8, 2020
Modeling_bert use next-token prediction? 🤗Transformers	4	163	September 10, 2024

BertForMaskedLM’s loss and scores, how the loss is computed?

Related topics