[HELP] How to fix IndexError: index out of range in self

Hi everyone,

I am running into an ongoing problem when training my language model from scratch using the following tutorial: notebooks/language_modeling_from_scratch.ipynb at master · huggingface/notebooks · GitHub

I have trained my tokenizer on a Word Piece model like BERT, where I added my special tokens and then successfully saved the tokenizer, in which I am wanting to use it to train my language model from scratch.

Now, when I run the following code to train my model:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

trainer = Trainer(


This is the error I get, which I do not know why it is happening:

***** Running training *****
  Num examples = 5660
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2124


IndexError                                Traceback (most recent call last)

<ipython-input-28-3435b262f1ae> in <module>()
----> 1 trainer.train()

11 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self