Adding New Tokens - IndexError: index out of range in self

Hi,

I have added custom tokens using this code:

# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['😎', '🤬'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.

However, when I execute this code:

trainer.train()

I get this error:

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-42-3435b262f1ae> in <module>()
----> 1 trainer.train()

11 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self
1 Like

hey @anon58275033 without seeing the full code it’s a bit hard to debug, but my first question would be whether you tokenized the corpus after adding the new tokens? if yes, did you observe whether the tokenization is working as expected?

Hello @lewtun, how do I tokensize the corpus after adding the new tokens? Also, this issue in conjunction with this issue I am having: [HELP] How to include emojis in masked language modelling?

Please stop spamming every topic with this other topic. If no one answered it, you should make sure you gave all the information inside for other people to properly help you.

hey @anon58275033 you can tokenize the corpus in the usual way after you’ve added new tokens with tokenizer.add_tokens. since it seems you’re doing masked language modeling, you might want to check out this tutorial to see how this is done: Google Colaboratory

Hi, I have checked out that tutorial. Is it possible to add the emojis to the vocabulary somehow?