Hi!
I want to train BERT from scratch using BertForMaskedLM. I was struggling with whether or not should I replace every label with -100 apart from tokens with [MASK] token.
Later I found this post : BertForMaskedLM train - Transformers - Hugging Face Forums and I thought that everything is clear for me, I should replace every non-masked tokens with -100.
BUT I also found the Google Colab notebook ( Masked_Language_Modeling_(MLM)+_Fine_Tuning_for_Text_Classification_with_BERT.ipynb - Colaboratory (google.com) ) with this function
def create_masked_lm_predictions(token_ids, masked_lm_prob=0.15, max_predictions_per_seq=10):
masked_token_ids = copy.deepcopy(token_ids)
masked_token_labels = copy.deepcopy(token_ids)
masked_token_labels[masked_token_labels==101] = -100
masked_token_labels[masked_token_labels==102] = -100
for ind in range(len(token_ids)):
len_tokens = (tf.math.count_nonzero(token_ids[ind]).numpy())
cand_indices = [i for i in range(len_tokens) if token_ids[ind][i] not in [102, 101]]
num_to_predict = min(max_predictions_per_seq, max(1, int(round(len_tokens * masked_lm_prob))))
masked_lms = []
random.shuffle(cand_indices)
for index_token in cand_indices:
if len(masked_lms) >= num_to_predict:
break
masked_token = None
if random.random() < masked_lm_prob:
#80% of time replace with mask
if random.random() < 0.8:
masked_token = tokenizer.convert_tokens_to_ids("[MASK]")
else:
#10% of the time keep original
if random.random() < 0.5:
masked_token = token_ids[ind][index_token]
else:
#10% of the time replace with random
masked_token = random.randint(999, tokenizer.vocab_size-1)
if masked_token != None:
masked_token_ids[ind][index_token] = masked_token
masked_lms.append(masked_token)
else:
#Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
#Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
masked_token_labels[ind][index_token] = -100
return masked_token_ids, masked_token_labels
Everything was looking nice at first sight, we are masking tokens according to specific probability, and all of the other tokens are replaced with -100, but let’s analyze this part of code.
if len(masked_lms) >= num_to_predict:
break
At the beginning of the function, we were sampling number of tokens to replace. Let’s say that we sampled 1. If we are going to replace a token with [MASK] in the first iteration of the loop, then loop is going to break (because len(masked_lms == num_to_predict == 1) and all of the other tokens are not going to be tokenized with -100!
Please, can you tell me if I am right or am I missing something? Every advice will be really appreciated. Thanks