I have a question
When training using BertForMaskedLM, is the train data as below correct?
- token2idx
<pad> : 0, <mask>: 1, <cls>:2, <sep>:3
- max len : 8
- input token
<cls> hello i <mask> cats <sep>
- input ids
[2, 34,45,1,56,3,0,0]
- attention_mask
[1,1,1,1,1,1,0,0]
- labels
[-100,-100,-100,64,-100,-100,-100,-100]
I wonder if I should also assign -100 to labels for padding token.