Hello guys,
I am trying out multiclass text classification using DistilBERT. It is a dataset with user feedback on a product and there are 4 categories.
0 - Good - “easy to use”
1 - Bad - “slow and constant crashing”
2 - Questions - “can you add feature x and y?”
3 - Others - “NIL”
I am following the guide here, I padded the datasets like that:
train_encodings = tokenizer(train_texts, truncation=True, padding='max_length', max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding='max_length', max_length=128)
I’ve set it to 128 for both as I thought having different pads would cause issues but some other posts here seem to suggest otherwise.
From my limited knowledge, in order to get new predictions from the model, I would have to encode the new texts and also tokenize them while padding it to 128, that is what I did but the prediction done was wrong. Meanwhile if I didn’t pad it at all, it would predict correctly.
Code:
new_prediction = tokenizer('good', truncation=True, padding='max_length', max_length=128)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print("Padded with all 0s version")
print(sentence)
print(tokenizer.batch_decode(sentence, skip_special_tokens=True))
print(model(sentence)[0].argmax(1))
print(model(sentence))
print("\nNo 0s version")
new_prediction = tokenizer('good', truncation=True, padding=True)
new_text = torch.Tensor(new_prediction['input_ids']).long().reshape(1, len(new_prediction['input_ids'])).to('cuda:0')
print(new_text)
print(tokenizer.batch_decode(new_text, skip_special_tokens=True))
print(model(new_text)[0].argmax(1))
print(model(new_text))
Output:
Padded with all 0s version
tensor([[ 101, 2204, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')
['good']
tensor([1], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5761, 1.5903, -1.0050, -1.0793]], device='cuda:0',
grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
No 0s version
tensor([[ 101, 2204, 102]], device='cuda:0')
['good']
tensor([0], device='cuda:0')
SequenceClassifierOutput(loss=None, logits=tensor([[ 6.5631, -2.2075, -1.9927, -2.2027]], device='cuda:0',
grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
As you can see, the word ‘good’ should return category 0, the padded version does not while the non-padded version returns correctly.
I am a beginner and am having a hard time understanding why, can someone enlighten me, thanks!