Padding causes wrong predictions?

Hello guys,

I am trying out multiclass text classification using DistilBERT. It is a dataset with user feedback on a product and there are 4 categories.

0 - Good - “easy to use”
1 - Bad - “slow and constant crashing”
2 - Questions - “can you add feature x and y?”
3 - Others - “NIL”

I am following the guide here, I padded the datasets like that:

train_encodings = tokenizer(train_texts, truncation=True, padding='max_length', max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding='max_length', max_length=128)

I’ve set it to 128 for both as I thought having different pads would cause issues but some other posts here seem to suggest otherwise.

From my limited knowledge, in order to get new predictions from the model, I would have to encode the new texts and also tokenize them while padding it to 128, that is what I did but the prediction done was wrong. Meanwhile if I didn’t pad it at all, it would predict correctly.

Code:

new_prediction = tokenizer('good', truncation=True, padding='max_length', max_length=128)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print("Padded with all 0s version")
print(sentence)
print(tokenizer.batch_decode(sentence, skip_special_tokens=True))
print(model(sentence)[0].argmax(1))
print(model(sentence))

print("\nNo 0s version")
new_prediction = tokenizer('good', truncation=True, padding=True)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print(new_text)
print(tokenizer.batch_decode(new_text, skip_special_tokens=True))
print(model(new_text)[0].argmax(1))
print(model(new_text))

Output:

Padded with all 0s version
tensor([[ 101, 2204,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]], device='cuda:0')
['good']
tensor([1], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5761,  1.5903, -1.0050, -1.0793]], device='cuda:0',
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

No 0s version
tensor([[ 101, 2204,  102]], device='cuda:0')
['good']
tensor([0], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 6.5631, -2.2075, -1.9927, -2.2027]], device='cuda:0',
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

As you can see, the word ‘good’ should return category 0, the padded version does not while the non-padded version returns correctly.

I am a beginner and am having a hard time understanding why, can someone enlighten me, thanks!

Please have a look at the course, in particular this section. You need to pass the attention mask returned by the tokenizer to have you model ignore padding.

1 Like

Ah, that explains my problems!

I have fixed my issues by including the attention mask, thank you very much!