Self-pretrained model predicts token with -1 index gap

Hi.

I’ve tried to train a pretrained-masked language model with “TFBertLMHeadModel”. I use the TF-based implementation because my training env. is TPU on Google Colab (See the section)

My issue is that the pre-trained model predicts tokens with a -1 token gap from the input token sequence, i.e. the first token is missed in the predicted token sequence. I’d like to know what’s the issue reason.

Let me show concretely, My input token sequence (Japanese) is ['[CLS]', 'テレビ', 'で', 'サッカー', 'の', '試合', 'を', '見る', '。', '[SEP]'], which means “I watch a football game on TV”. The corresponding token-ids are tf.Tensor([[ 2 12471 878 14177 885 13325 921 12182 809 3]], shape=(1, 10), dtype=int32), where token-id=2 is [CLS] and token-id=3 is [SEP].

When I get a token prediction with my pre-trained model, the prediction sequence is the following.

The output shows the top 5 tokens for each index position

token-index=0, top5 token ids=[12471 12538 11941 12321 12207], top5 tokens(JP)=['テレビ', '英語', '日本', 'ゲーム', 'ネット']
token-index=1, top5 token ids=[  878 12876   871   907 12289], top5 tokens(JP)=['で', 'アニメ', 'だ', 'や', 'にて']
token-index=2, top5 token ids=[14177 12676 12330 14396  1354], top5 tokens(JP)=['サッカー', '音楽', '映画', '野球', '初']
token-index=3, top5 token ids=[  885   851 11893   886  1016], top5 tokens(JP)=['の', 'が', 'その', 'は', '・']
token-index=4, top5 token ids=[13325 12599  4651 12127 12833], top5 tokens(JP)=['試合', '続き', '話', '結果', '様子']
token-index=5, top5 token ids=[  921   878 11886   905   851], top5 tokens(JP)=['を', 'で', 'から', 'も', 'が']
token-index=6, top5 token ids=[12182 12567 12192 13179 13552], top5 tokens(JP)=['見る', '読む', 'みる', '知る', '探す']
token-index=7, top5 token ids=[  809    30 11881   819    41], top5 tokens(JP)=['。', '!', 'こと', '」', '.']
token-index=8, top5 token ids=[809 819  36  30 455], top5 tokens(JP)=['。', '」', ')', '!', '”']
token-index=9, top5 token ids=[  809    30   808 12182   484], top5 tokens(JP)=['。', '!', '、', '見る', '→']

The predicted output misses a token that corresponds to “[CLS]”, hence input-output token-index correspondence has a -1 gap. Instead of the first token, my model added another token in the end ([ 809 30 808 12182 484] at the index 9)

When I try the same input to bert-based-japanese that is officially provided by Huggingface, the predicted output has a token that corresponds to “[CLS]” that is “[ 6175, 2867, 15591, 14847, 11453]” at the index 0, which explains a senetence topic. Here, the input sequence is tensor([[ 2, 571, 12, 1301, 5, 608, 11, 2867, 8, 3]]).

token-index=0, top5 token ids=tensor([ 6175,  2867, 15591, 14847, 11453]), top5 tokens(JP)=['趣味', '見る', '楽しみ', '観戦', '楽しむ']
token-index=1, top5 token ids=tensor([ 571, 9921, 1301, 1584,  792]), top5 tokens(JP)=['テレビ', 'テレビ局', 'サッカー', 'ラジオ', 'バス']
token-index=2, top5 token ids=tensor([ 12,   9,  13, 128,  40]), top5 tokens(JP)=['で', 'は', 'と', '生', 'から']
token-index=3, top5 token ids=tensor([ 1301,  9485,  1784,  6856, 19395]), top5 tokens(JP)=['サッカー', 'フットボール', 'スポーツ', 'バレーボール', 'フットサル']
token-index=4, top5 token ids=tensor([   5, 3801,   35, 1611,  534]), top5 tokens(JP)=['の', 'ワールドカップ', '・', '競技', 'ノ']
token-index=5, top5 token ids=tensor([ 608, 2012,  733, 2626,  735]), top5 tokens(JP)=['試合', 'プレー', 'ゲーム', 'ニュース', '話']
token-index=6, top5 token ids=tensor([ 11, 362,  28,   9,  14]), top5 tokens(JP)=['を', 'について', 'も', 'は', 'が']
token-index=7, top5 token ids=tensor([ 2867,  8282, 10317,  9010, 17122]), top5 tokens(JP)=['見る', 'みる', '聞く', '読む', '聴く']
token-index=8, top5 token ids=tensor([  8,  45, 258,  13,   6]), top5 tokens(JP)=['。', 'こと', ')。', 'と', '、']
token-index=9, top5 token ids=tensor([  8,  23,   6, 375,  40]), top5 tokens(JP)=['。', '(', '、', '他', 'から']

Why the index=0 token is missed from my model prediction?

I describe my training environment and data generation process.

Training condition

block_size of grouping is 64 or 512, both tested.
Bert training parameters are default as BertConfig.

Data generation process

  1. tokenize
  2. masking
  3. grouping
  4. convert to .tfrecord
  5. save to GSC

Environment

TPU and CPU on Google Colab.
Data are stored on Google Cloud storage with “.tfrecord” format.

  • tensorflow 2.6.0
  • transformers 4.15.0

What I’ve tried so far

I confirmed that

    • my training data has [CLS] at the index 0, [SEP] at the last index.
    • both training loss and validation loss go down as iteration.

Thank you!