"IndexError: index out of range in self" for bert LM example on https://huggingface.co/transformers/quickstart.html

Hi!

I was trying to use my own data for the language model example (BERT) mentioned here:

However, I get an IndexError: index out of range in self when I use my own data. At first I thought that it is related to the sequence length but I also get the error for sequences smaller than <512.

The code is:

tokenized_text = ['[CLS]', '#', '#', 'steps', '[SEP]', '1', '.', 'if', 'the', 'area', 'is', 'hot', 'or', 'in', '##fl', '##ame', '##d', 'after', 'your', 'laser', 'tattoo', 'removal', 'session', 'you', 'can', 'apply', 'an', 'ice', 'pack', 'wrapped', 'in', 'a', 'damp', 'cloth', '.', '[SEP]', '2', '.', 'over', 'the', 'counter', 'pain', 'relief', 'such', 'as', 'para', '##ce', '##tam', '##ol', 'can', 'help', 'by', 'reducing', 'any', 'temporary', 'pain', '.', '[SEP]', '3', '.', 'el', '##eva', '##te', 'the', 'area', 'is', 'its', 'an', 'ex', '##tre', '##mity', 'such', 'as', 'a', 'wrist', 'or', 'ankle', 'to', 'reduce', 'swelling', '.', '[SEP]', 'keep', 'the', 'tattoo', 'site', 'clean', 'and', 'dry', 'and', 'avoid', 'soaking', '[MASK]', 'in', 'the', 'first', 'week', 'or', 'two', 'during', 'the', 'healing', 'stage', '.', '[SEP]']
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
index_of_masked_token = tokenized_text.index('[MASK]')
# make the segments_ids 
counter = 0 
segments_ids = []
for token in tokenized_text: 
      segments_ids.append(counter)
      if token == '[SEP]':
         counter +=1
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
print("size of tokens in tensor {0}".format(tokens_tensor.shape))
print("size of segment tokens in tensor {0}".format(segments_tensors.shape))

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# Predict hidden states features for each layer
with torch.no_grad():
     outputs = model(tokens_tensor, token_type_ids=segments_tensors)
     encoded_layers = outputs[0]
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()


# Predict all tokens
with torch.no_grad():
       # error is caused by the line below. 
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]

The error is:

      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 752, in forward
    embedding_output = self.embeddings(
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 180, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
    return F.embedding(
      File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    IndexError: index out of range in self

Does anyone know why I get this IndexError?

You are training to process multiple sentences separated by SEP tokens. That does not make sense, but it is not clear what you are trying to do.

The index error is raised because BERT was pretrained with only two segment IDs, 0 and 1. Those were needed for the NSP objective. It therefore does not make sense to add more segments/segment IDs - and that simply won’t work as you found out. Perhaps you should start by explaining what you are trying to do.

I highly recommend you to read the original BERT paper to better understand this.

Hi @Kwiebes1995,

You don’t need to put a SEP token between each sentence. The SEP token is intended particularly for tasks like Next Sentence Prediction, where you are predicting whether a second sentence logically follows a first sentence.
In some cases, you will want to use only one text, that can be made up of lots of sentences. If so, you would have a SEP token at the very end of the text only. In which case, I think the segment_ids would all be 0, and I think you wouldn’t need to define it because it is an optional parameter.
Since you have only used one MASK token in the whole text, I assume you are treating the whole text as one input. Is that correct? If so, then I don’t think you should separate the text into sentences.
You can still make a tensor out of a single input text (if the model is expecting a tensor): it will just have a dimension of 1 in that direction.

[I might have misunderstood what you are doing. I’ve only used BertForSequenceClassification.]