BertForMaskedLM training from scratch

Hi!
I want to train BERT from scratch using BertForMaskedLM. I was struggling with whether or not should I replace every label with -100 apart from tokens with [MASK] token.
Later I found this post : BertForMaskedLM train - :hugs:Transformers - Hugging Face Forums and I thought that everything is clear for me, I should replace every non-masked tokens with -100.
BUT I also found the Google Colab notebook ( Masked_Language_Modeling_(MLM)+_Fine_Tuning_for_Text_Classification_with_BERT.ipynb - Colaboratory (google.com) ) with this function

def create_masked_lm_predictions(token_ids, masked_lm_prob=0.15, max_predictions_per_seq=10):  
  masked_token_ids = copy.deepcopy(token_ids)
  masked_token_labels = copy.deepcopy(token_ids)
  masked_token_labels[masked_token_labels==101] = -100
  masked_token_labels[masked_token_labels==102] = -100
  for ind in range(len(token_ids)):
    len_tokens = (tf.math.count_nonzero(token_ids[ind]).numpy())
    cand_indices = [i for i in range(len_tokens) if token_ids[ind][i] not in [102, 101]]
    num_to_predict = min(max_predictions_per_seq, max(1, int(round(len_tokens * masked_lm_prob))))
    masked_lms = []
    random.shuffle(cand_indices) 
    for index_token in cand_indices:
      if len(masked_lms) >= num_to_predict:
        break
      masked_token = None
      if random.random() < masked_lm_prob:
        #80% of time replace with mask
        if random.random() < 0.8: 
          masked_token = tokenizer.convert_tokens_to_ids("[MASK]")
        else:
          #10% of the time keep original
          if random.random() < 0.5: 
            masked_token = token_ids[ind][index_token]
          else: 
            #10% of the time replace with random
            masked_token = random.randint(999, tokenizer.vocab_size-1) 
      if masked_token != None:
        masked_token_ids[ind][index_token] = masked_token
        masked_lms.append(masked_token) 
      else:
        #Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] 
        #Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
        masked_token_labels[ind][index_token] = -100 

  return masked_token_ids, masked_token_labels

Everything was looking nice at first sight, we are masking tokens according to specific probability, and all of the other tokens are replaced with -100, but let’s analyze this part of code.

if len(masked_lms) >= num_to_predict:
        break

At the beginning of the function, we were sampling number of tokens to replace. Let’s say that we sampled 1. If we are going to replace a token with [MASK] in the first iteration of the loop, then loop is going to break (because len(masked_lms == num_to_predict == 1) and all of the other tokens are not going to be tokenized with -100!

Please, can you tell me if I am right or am I missing something? Every advice will be really appreciated. Thanks :slight_smile: