Padding causes wrong predictions?

jhlee · August 11, 2021, 6:42am

Hello guys,

I am trying out multiclass text classification using DistilBERT. It is a dataset with user feedback on a product and there are 4 categories.

0 - Good - “easy to use”
1 - Bad - “slow and constant crashing”
2 - Questions - “can you add feature x and y?”
3 - Others - “NIL”

I am following the guide here, I padded the datasets like that:

train_encodings = tokenizer(train_texts, truncation=True, padding='max_length', max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding='max_length', max_length=128)

I’ve set it to 128 for both as I thought having different pads would cause issues but some other posts here seem to suggest otherwise.

From my limited knowledge, in order to get new predictions from the model, I would have to encode the new texts and also tokenize them while padding it to 128, that is what I did but the prediction done was wrong. Meanwhile if I didn’t pad it at all, it would predict correctly.

Code:

new_prediction = tokenizer('good', truncation=True, padding='max_length', max_length=128)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print("Padded with all 0s version")
print(sentence)
print(tokenizer.batch_decode(sentence, skip_special_tokens=True))
print(model(sentence)[0].argmax(1))
print(model(sentence))

print("\nNo 0s version")
new_prediction = tokenizer('good', truncation=True, padding=True)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print(new_text)
print(tokenizer.batch_decode(new_text, skip_special_tokens=True))
print(model(new_text)[0].argmax(1))
print(model(new_text))

Output:

Padded with all 0s version
tensor([[ 101, 2204,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]], device='cuda:0')
['good']
tensor([1], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5761,  1.5903, -1.0050, -1.0793]], device='cuda:0',
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

No 0s version
tensor([[ 101, 2204,  102]], device='cuda:0')
['good']
tensor([0], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 6.5631, -2.2075, -1.9927, -2.2027]], device='cuda:0',
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

As you can see, the word ‘good’ should return category 0, the padded version does not while the non-padded version returns correctly.

I am a beginner and am having a hard time understanding why, can someone enlighten me, thanks!

sgugger · August 11, 2021, 6:55am

Please have a look at the course, in particular this section. You need to pass the attention mask returned by the tokenizer to have you model ignore padding.

jhlee · August 11, 2021, 8:07am

Ah, that explains my problems!

I have fixed my issues by including the attention mask, thank you very much!

Topic		Replies	Views
How can I make sure Tokenizer pads to a fixed length? 🤗Tokenizers	2	2101	March 29, 2022
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
Padding should be True, please explain Beginners	1	12	August 18, 2024
Tokenizer truncation Beginners	1	1788	June 14, 2022
It asks to add padding or truncation but I have already done it Beginners	1	827	October 6, 2023

Padding causes wrong predictions?

Related topics