BertForMaskedLM train

I have a question
When training using BertForMaskedLM, is the train data as below correct?

  • token2idx
<pad> : 0, <mask>: 1, <cls>:2, <sep>:3
  • max len : 8
  • input token
 <cls> hello i <mask> cats <sep>
  • input ids
 [2, 34,45,1,56,3,0,0]
  • attention_mask
 [1,1,1,1,1,1,0,0]
  • labels
 [-100,-100,-100,64,-100,-100,-100,-100]

I wonder if I should also assign -100 to labels for padding token.

1 Like

Hi,
Were you able to figure it out? I’m also trying to do the same thing.

Thanks,
Ayala

you should replace all tokens (including paddding) in labels with -100 except the masked tokens so the loss will only be calculated for masked tokens.