Certain words don't work with bert?

close · June 3, 2021, 11:30pm

hi, I was trying to run bert but was getting the error “IndexError: index out of range in self”. after troubleshooting for a couple of days I figured out it was the word “screwing” that was breaking my code. is this a bug or is there certain words you cant use with bert? or am I just doing something wrong? thanks.

here’s the code example:

import transformers
import torch
import torch.nn as nn

class SentimentClassifier(nn.Module):
def init(self, n_classes):
super(SentimentClassifier, self).init()
self.bert = transformers.BertModel.from_pretrained(‘bert-base-cased’)
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

def forward(self, input_ids, mask):
    output = self.bert(
      input_ids=input_ids,
      attention_mask=mask
    )
    output = self.drop(output['pooler_output'])
    return self.out(output)

this doesn’t work

batch_sentences = [
‘screwing’,
]

running this works

batch_sentences2 = [
‘this is a test sentence’,
‘another one’
]

bert_model = transformers.BertModel.from_pretrained(‘bert-base-uncased’)
tokenizer = transformers.BertTokenizer.from_pretrained(‘bert-base-uncased’)

encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True, add_special_tokens=True)
samples = torch.tensor(encoded_inputs[‘input_ids’])
targets = torch.zeros(samples.shape[0]).long()
mask = (samples != 0)

print(samples.shape)
model = SentimentClassifier(3)
EPOCHS = 10
optimizer = transformers.AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = 1 * EPOCHS
scheduler = transformers.get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)

criterion = nn.CrossEntropyLoss()
model = model.train()

for i in range(EPOCHS):
print(i)
preds = model(input_ids=samples, mask=mask)
loss = criterion(preds, targets)
print(loss)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()

ehalit · June 15, 2021, 5:36am

I hope you’ve figured out the solution already but it seems that the tokenizer you use bert-base-uncased and the model initialized in the SentimentClassifier class bert-base-cased do not match. There may be overlaps in the vocabulary of the cased and uncased tokenizers which may seem working fine in some cases but the same token ids can decode to totally different text sequences with different tokenizers. In general, using an uncased tokenizer for a cased model or vice versa should always be erroneous, even if there is no error message showing up.

close · June 15, 2021, 12:38pm

yeah, I tried switching so that both were uncased but it didn’t fix it. I ended up just going through the dataset with a batch size of one and if the index error was raised, I would delete the sentence from the dataset. probably not the best way to deal with it but it worked.

Topic		Replies	Views
"IndexError: index out of range in self" in BertForPreTraining Beginners	0	1036	January 31, 2022
"IndexError: index out of range in self" for bert LM example on https://huggingface.co/transformers/quickstart.html Beginners	2	6366	October 29, 2020
Adding New Tokens - IndexError: index out of range in self Beginners	5	2697	June 17, 2021
BERT encoding for batch of Sentence Pairs raise IndexError: index out of range in self Beginners	1	398	November 16, 2022
Sentence pair classification with BertForSequenceClassification cause IndexError: index out of range in self 🤗Transformers	0	1548	November 10, 2022

Certain words don't work with bert?

this doesn’t work

running this works

Related topics