A colleague and I each ran an experiment following the example found at transformers/examples/research_projects/zero-shot-distillation at master · huggingface/transformers · GitHub. Even though it was a zero-shot experiment we used data for which we had labels to evaluate how well the zero-shot prediction performed. When we ran the distillation part of our experiments we both were surprised to discover that the accuracy of the distilled student model was significantly higher than the zero-shot teacher model (experiment 1: accuracy of the distilled model 48.12% > accuracy of the zero-shot model 42.91%, experiment 2: accuracy of the distilled model 79.82% > accuracy of the zero-shot model 77.36%). In the second experiment there is a small possibility that this performance increase could be explained by chance (5000 examples), but not for the first experiment which has 86651 examples. I wonder if other people got similar improvement and if it’s a known phenomenon what would explain it.