Zero-shot and distillation - Improved distilled model over teacher model

mathbouchard · March 18, 2021, 5:53pm

A colleague and I each ran an experiment following the example found at transformers/examples/research_projects/zero-shot-distillation at master · huggingface/transformers · GitHub. Even though it was a zero-shot experiment we used data for which we had labels to evaluate how well the zero-shot prediction performed. When we ran the distillation part of our experiments we both were surprised to discover that the accuracy of the distilled student model was significantly higher than the zero-shot teacher model (experiment 1: accuracy of the distilled model 48.12% > accuracy of the zero-shot model 42.91%, experiment 2: accuracy of the distilled model 79.82% > accuracy of the zero-shot model 77.36%). In the second experiment there is a small possibility that this performance increase could be explained by chance (5000 examples), but not for the first experiment which has 86651 examples. I wonder if other people got similar improvement and if it’s a known phenomenon what would explain it.

Topic		Replies	Views
Poor performance in zero-shot learning when using the model 'typeform/distilbert-base-uncased-mnli' Models	6	2870	May 28, 2021
Improving Zero-shot accuracy Intermediate	0	946	March 31, 2022
How do I fine-tune a zero-shot learning model to my task? Beginners	2	1598	May 26, 2022
Is it possible to fine tune zero shot text classification model for our data set? Beginners	0	229	June 19, 2023
Class prediction in a zero/few-shot setting at inference time Beginners	0	401	July 27, 2022

Zero-shot and distillation - Improved distilled model over teacher model

Related topics