Adding New Tokens - IndexError: index out of range in self

anon58275033 · June 11, 2021, 3:15pm

Hi,

I have added custom tokens using this code:

# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['😎', '🤬'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.

However, when I execute this code:

trainer.train()

I get this error:

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-42-3435b262f1ae> in <module>()
----> 1 trainer.train()

11 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self

lewtun · June 11, 2021, 3:48pm

hey @anon58275033 without seeing the full code it’s a bit hard to debug, but my first question would be whether you tokenized the corpus after adding the new tokens? if yes, did you observe whether the tokenization is working as expected?

anon58275033 · June 11, 2021, 3:59pm

Hello @lewtun, how do I tokensize the corpus after adding the new tokens? Also, this issue in conjunction with this issue I am having: [HELP] How to include emojis in masked language modelling?

sgugger · June 11, 2021, 6:38pm

Please stop spamming every topic with this other topic. If no one answered it, you should make sure you gave all the information inside for other people to properly help you.

lewtun · June 16, 2021, 12:49pm

hey @anon58275033 you can tokenize the corpus in the usual way after you’ve added new tokens with tokenizer.add_tokens. since it seems you’re doing masked language modeling, you might want to check out this tutorial to see how this is done: Google Colaboratory

anon58275033 · June 17, 2021, 10:37am

Hi, I have checked out that tutorial. Is it possible to add the emojis to the vocabulary somehow?

Topic		Replies	Views
"IndexError: index out of range in self" in BertForPreTraining Beginners	0	1036	January 31, 2022
[HELP] How to fix IndexError: index out of range in self Beginners	1	1550	March 31, 2023
IndexError: index out of range in self while training a language model from scratch 🤗Transformers	0	298	April 9, 2024
IndexError: index out of range in self on train() Beginners	0	1226	June 19, 2023
"IndexError: index out of range in self" for bert LM example on https://huggingface.co/transformers/quickstart.html Beginners	2	6363	October 29, 2020

Adding New Tokens - IndexError: index out of range in self

Related topics