Hi!
I was trying to use my own data for the language model example (BERT) mentioned here:
However, I get an IndexError: index out of range in self when I use my own data. At first I thought that it is related to the sequence length but I also get the error for sequences smaller than <512.
The code is:
tokenized_text = ['[CLS]', '#', '#', 'steps', '[SEP]', '1', '.', 'if', 'the', 'area', 'is', 'hot', 'or', 'in', '##fl', '##ame', '##d', 'after', 'your', 'laser', 'tattoo', 'removal', 'session', 'you', 'can', 'apply', 'an', 'ice', 'pack', 'wrapped', 'in', 'a', 'damp', 'cloth', '.', '[SEP]', '2', '.', 'over', 'the', 'counter', 'pain', 'relief', 'such', 'as', 'para', '##ce', '##tam', '##ol', 'can', 'help', 'by', 'reducing', 'any', 'temporary', 'pain', '.', '[SEP]', '3', '.', 'el', '##eva', '##te', 'the', 'area', 'is', 'its', 'an', 'ex', '##tre', '##mity', 'such', 'as', 'a', 'wrist', 'or', 'ankle', 'to', 'reduce', 'swelling', '.', '[SEP]', 'keep', 'the', 'tattoo', 'site', 'clean', 'and', 'dry', 'and', 'avoid', 'soaking', '[MASK]', 'in', 'the', 'first', 'week', 'or', 'two', 'during', 'the', 'healing', 'stage', '.', '[SEP]']
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
index_of_masked_token = tokenized_text.index('[MASK]')
# make the segments_ids
counter = 0
segments_ids = []
for token in tokenized_text:
segments_ids.append(counter)
if token == '[SEP]':
counter +=1
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
print("size of tokens in tensor {0}".format(tokens_tensor.shape))
print("size of segment tokens in tensor {0}".format(segments_tensors.shape))
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
# Predict hidden states features for each layer
with torch.no_grad():
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
encoded_layers = outputs[0]
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# Predict all tokens
with torch.no_grad():
# error is caused by the line below.
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
predictions = outputs[0]
The error is:
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 752, in forward
embedding_output = self.embeddings(
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 180, in forward
token_type_embeddings = self.token_type_embeddings(token_type_ids)
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
return F.embedding(
File "/Users/talita/Documents/PhD/corpora/rulebook_diffs/2019-09-23/boardgame_scripts/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Does anyone know why I get this IndexError?