Poor performance in zero-shot learning when using the model 'typeform/distilbert-base-uncased-mnli'


I have tried to use the model ‘typeform/distilbert-base-uncased-mnli’ in zero-shot classification (multi-class, not multi-label). However, I am getting very poor results especially when compared to using the model ‘facebook/bart-large-mnli’. I have used both the zero-shot classification pipeline and without it, and the results are still just as bad.

The test dataset I am trying to classify into categories has 46 entries and 13 categories, and the accuracy I am getting with the DistilBERT MNLI model is around 15%, whereas this goes up to 57% with the Bart large MNLI model.

Has anyone else also found such a massive difference in performance when using a distilled model for this task? I assumed it would be comparable since DistilBERT and BERT have comparable performances for many NLP tasks, but the results are too different.

Also, does anyone have any benchmark results for these two models, either in NLI tasks or zero-shot classification tasks?

Thank you!

1 Like


I am the one who fine-tuned this model. The original DistilBERT paper reports 82.2 on accuracy in the MNLI task while BERT-base has 86.7 accuracy. Other following papers show slightly different numbers but in the same ballpark. For example, the MobileBERT paper reports 81.5 and 84.6 on accuracy on DistilBERT and BERT-base respectively.

In my fine-tuning, I got 82 accuracy for both MNLI and MNLI-mm. I use the run_glue.py (huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py) script to fine-tune the model with these hyperparameters:

  • max_seq_length: 128
  • per_device_train_batch_size: 16
  • learning_rate: 2e-5
  • num_train_epochs: 5

When running this model on our own very small zero-shot classification test data, we didn’t see a big drop in accuracy, but we did observe that the model is less “certain” on the correct answer, i.e., it returns a lower probability on the correct label.

You can also try our fine-tuned MobileBERT. It has a marginally better result in our testing.


Hi @hhschu,

Thank you for the information provided with regards to this model, it is incredibly useful. I also noticed that the scores given with this model were lower than for other models, which is odd.

I will try to use the fine-tuned MobileBERT then and see whether the results improve. What accuracy did you obtain from fine-tuning MobileBERT on MNLI and MNLI-mm, if that is alright to ask?

@valkyrie We got 84 accuracy on MobileBERT. Qualitatively, it’s still much worse than RoBERTa (91 accuracy on MNLI in our experiment) on our zero-shot test data, in terms of certainty of the correct label. But it’s better than DistilBERT.

1 Like

@hhschu thank you again. Just to confirm, on your zero-shot classification experiments you used it to obtain one class per entry (i.e. setting multi-label=False) instead of a multi-label scenario, is that right? Since this is how I am using it at the moment.

@valkyrie our one-shot data is single-label, indeed. In this case, turning multi-label on/off doesn’t make much difference in accuracy in my opinion. Multi-label off is just a softmax layer on top of multi-label on’s probabilities after all. The top label is the same.

@hhschu thank you for all your help with this, I really appreciate your input and the information you’ve given me.