"IndexError: index out of range in self" for bert LM example on https://huggingface.co/transformers/quickstart.html

Kwiebes1995 · October 29, 2020, 10:58am

Hi!

I was trying to use my own data for the language model example (BERT) mentioned here:

However, I get an IndexError: index out of range in self when I use my own data. At first I thought that it is related to the sequence length but I also get the error for sequences smaller than <512.

The code is:

tokenized_text = ['[CLS]', '#', '#', 'steps', '[SEP]', '1', '.', 'if', 'the', 'area', 'is', 'hot', 'or', 'in', '##fl', '##ame', '##d', 'after', 'your', 'laser', 'tattoo', 'removal', 'session', 'you', 'can', 'apply', 'an', 'ice', 'pack', 'wrapped', 'in', 'a', 'damp', 'cloth', '.', '[SEP]', '2', '.', 'over', 'the', 'counter', 'pain', 'relief', 'such', 'as', 'para', '##ce', '##tam', '##ol', 'can', 'help', 'by', 'reducing', 'any', 'temporary', 'pain', '.', '[SEP]', '3', '.', 'el', '##eva', '##te', 'the', 'area', 'is', 'its', 'an', 'ex', '##tre', '##mity', 'such', 'as', 'a', 'wrist', 'or', 'ankle', 'to', 'reduce', 'swelling', '.', '[SEP]', 'keep', 'the', 'tattoo', 'site', 'clean', 'and', 'dry', 'and', 'avoid', 'soaking', '[MASK]', 'in', 'the', 'first', 'week', 'or', 'two', 'during', 'the', 'healing', 'stage', '.', '[SEP]']
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
index_of_masked_token = tokenized_text.index('[MASK]')
# make the segments_ids 
counter = 0 
segments_ids = []
for token in tokenized_text: 
      segments_ids.append(counter)
      if token == '[SEP]':
         counter +=1
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
print("size of tokens in tensor {0}".format(tokens_tensor.shape))
print("size of segment tokens in tensor {0}".format(segments_tensors.shape))

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# Predict hidden states features for each layer
with torch.no_grad():
     outputs = model(tokens_tensor, token_type_ids=segments_tensors)
     encoded_layers = outputs[0]
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()


# Predict all tokens
with torch.no_grad():
       # error is caused by the line below. 
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]

The error is:

      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 752, in forward
    embedding_output = self.embeddings(
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 180, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
    return F.embedding(
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    IndexError: index out of range in self

Does anyone know why I get this IndexError?

BramVanroy · October 29, 2020, 11:06am

You are training to process multiple sentences separated by SEP tokens. That does not make sense, but it is not clear what you are trying to do.

The index error is raised because BERT was pretrained with only two segment IDs, 0 and 1. Those were needed for the NSP objective. It therefore does not make sense to add more segments/segment IDs - and that simply won’t work as you found out. Perhaps you should start by explaining what you are trying to do.

I highly recommend you to read the original BERT paper to better understand this.

rgwatwormhill · October 29, 2020, 11:51am

Hi @Kwiebes1995,

You don’t need to put a SEP token between each sentence. The SEP token is intended particularly for tasks like Next Sentence Prediction, where you are predicting whether a second sentence logically follows a first sentence.
In some cases, you will want to use only one text, that can be made up of lots of sentences. If so, you would have a SEP token at the very end of the text only. In which case, I think the segment_ids would all be 0, and I think you wouldn’t need to define it because it is an optional parameter.
Since you have only used one MASK token in the whole text, I assume you are treating the whole text as one input. Is that correct? If so, then I don’t think you should separate the text into sentences.
You can still make a tensor out of a single input text (if the model is expecting a tensor): it will just have a dimension of 1 in that direction.

[I might have misunderstood what you are doing. I’ve only used BertForSequenceClassification.]

Topic		Replies	Views
"IndexError: index out of range in self" in BertForPreTraining Beginners	0	1039	January 31, 2022
Sentence pair classification with BertForSequenceClassification cause IndexError: index out of range in self 🤗Transformers	0	1556	November 10, 2022
Certain words don't work with bert? Beginners	2	315	June 15, 2021
BERT encoding for batch of Sentence Pairs raise IndexError: index out of range in self Beginners	1	403	November 16, 2022
Adding New Tokens - IndexError: index out of range in self Beginners	5	2731	June 17, 2021

"IndexError: index out of range in self" for bert LM example on https://huggingface.co/transformers/quickstart.html

Related topics