Regarding Training a Task Specific Knowledge Distillation model

pchhapolika · April 13, 2022, 2:20pm

I was referring to this code:

philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/knowledge-distillation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task-specific knowledge distillation for BERT using Hugging Face Transformers\n",
    "### Text Classification Example using `BERT-Base` as Teacher and `BERT-Tiny` as Student"
   ]
  },
  {
   "attachments": {
    "emotion-widget.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABRYAAAM4CAYAAAC5rb6iAAAMamlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnltSSWiBCEgJvQnSq5QQWgABqYKNkAQSSowJQcVeFhVcKyKKFV0VUXR1BWRREXtZFHtfLKgo66IuiqLyJiSg677yvfN9c+fPmTP/KZm5dwYArV6eVJqHagOQLymQJUSEsMampbNIHYAMDAAGtIA7jy+XsuPjYwCUwf7v8u4GQJT9VScl1z/H/6voCoRyPgDIeIgzBXJ+PsTNAOAb+FJZAQBEpd5yaoFUiedCrCeDAUJcpsTZKrxLiTNVuGnAJimBA/FlAMg0Hk+WDYDmPahnFfKzIY/mJ4hdJAKxBACtERAH8kU8AcTK2Efk509W4gqI7aC9FGIYD/DJ/IYz+2/8mUP8PF72EFblNSDkULFcmseb/n+W5n9Lfp5i0IcNbDSRLDJBmT+s4a3cydFKTIO4S5IZG6esNcS9YoGq7gCgVJEiMllljxrz5RxYP8CE2EXAC42G2BjicElebIxan5klDudCDFcLOk1cwE2C2ADixUJ5WKLaZotscoLaF1qXJeOw1fqzPNmAX6WvB4rcZLaa/41IyFXzY5pFoqRUiKkQWxWKU2Ih1oTYWZ6bGK22GVUk4sQO2sgUCcr4rSBOEEoiQlT8WGGWLDxBbV+SLx/MF9siEnNj1fhAgSgpUlUf7CSfNxA/zAW7LJSwkwd5hPKxMYO5CIShYarcsedCSXKimqdXWhCSoJqLU6V58Wp73EKYF6HUW0DsIS9MVM/FUwrg4lTx41nSgvgkVZx4UQ4vKl4VD74CxAAOCAUsoIAtE0wGOUDc2lXfBX+pRsIBD8hANhACJ7VmcEbqwIgEPhNBEfgDIiGQD80LGRgVgkKo/zykVT2dQNbAaOHAjFzwFOJ8EA3y4G/FwCzJkLcU8ARqxP/wzoOND+PNg005/u/1g9qvGjbUxKg1ikGPLK1BS2IYMZQYSQwn2uNGeCDuj8fAZzBsbrgP7juYx1d7wlNCG+ER4TqhnXB7kni+7LsoR4N2yB+urkXmt7XAbSCnJx6CB0B2yIwzcSPghHtAP2w8CHr2hFqOOm5lVVjfcf8tg2/+DbUdxYWCUoZRgil238/UdND0HGJR1vrb+qhizRyqN2do5Hv/nG+qL4B99PeW2GLsIHYGO46dw5qwesDCjmEN2EXsiBIPra4nA6tr0FvCQDy5kEf8D388tU9lJeUuNS6dLp9UYwXCaQXKjceZLJ0uE2eLClhs+HUQsrgSvvMIlpuLmysAym+N6vX1ljnwDUGY57/qFpgDEDC9v7+/6asuGr5zDx6B2//OV51tB3xNnAfg7Fq+Qlao0uHKBwG+JbTgTjMEpsAS2MF83IAX8AfBIAxEgTiQBNLARFhlEVznMjAVzATzQDEoBSvAGrAebAbbwC6wFxwA9aAJHAenwQVwGVwHd+Hq6QAvQTd4B/oQBCEhdISBGCJmiDXiiLghPkggEobEIAlIGpKBZCMSRIHMRBYgpcgqZD2yFalGfkYOI8eRc0gbcht5iHQib5CPKIbSUD3UBLVBR6I+KBuNRpPQCWg2OgUtQheiy9AKtArdg9ahx9EL6HW0HX2J9mAA08CYmDnmhPlgHCwOS8eyMBk2GyvByrEqrBZrhP/zVawd68I+4EScgbNwJ7iCI/FknI9PwWfjS/H1+C68Dj+JX8Uf4t34FwKdYExwJPgRuISxhGzCVEIxoZywg3CIcArupQ7COyKRyCTaEr3hXkwj5hBnEJcSNxL3EZuJbcTHxB4SiWRIciQFkOJIPFIBqZi0jrSHdIx0hdRB6iVrkM3IbuRwcjpZQp5PLifvJh8lXyE/I/dRtCnWFD9KHEVAmU5ZTtlOaaRconRQ+qg6VFtqADWJmkOdR62g1lJPUe9R32poaFho+GqM0RBrzNWo0NivcVbjocYHmi7NgcahjacpaMtoO2nNtNu0t3Q63YYeTE+nF9CX0avpJ+gP6L2aDE1nTa6mQHOOZqVmneYVzVdaFC1rLbbWRK0irXKtg1qXtLq0Kdo22hxtnvZs7Urtw9o3tXt0GDquOnE6+TpLdXbrnNN5rkvStdEN0xXoLtTdpntC9zEDY1gyOAw+YwFjO+MUo0OPqGerx9XL0SvV26vXqtetr6vvoZ+iP02/Uv+IfjsTY9owucw85nLmAeYN5sdhJsPYw4TDlgyrHXZl2HuD4QbBBkKDEoN9BtcNPhqyDMMMcw1XGtYb3jfCjRyMxhhNNdpkdMqoa7jecP/h/OElww8Mv2OMGjsYJxjPMN5mfNG4x8TUJMJEarLO5IRJlynTNNg0x7TM9KhppxnDLNBMbFZmdszsBUufxWblsSpYJ1nd5sbmkeYK863mreZ9FrYWyRbzLfZZ3LekWvpYZlmWWbZYdluZWY22mmlVY3XHmmLtYy2yXmt9xvq9ja1Nqs0im3qb57YGtlzbItsa23t2dLsguyl2VXbX7In2Pva59hvtLzugDp4OIodKh0uOqKOXo9hxo2PbCMII3xGSEVUjbjrRnNhOhU41Tg+dmc4xzvOd651fjbQamT5y5cgzI7+4eLrkuWx3ueuq6xrlOt+10fWNm4Mb363S7Zo73T3cfY57g/trD0cPoccmj1ueDM/Rnos8Wzw/e3l7ybxqvTq9rbwzvDd43/TR84n3Wepz1pfgG+I7x7fJ94Ofl1+B3wG/P/2d/HP9d/s/H2U7Sjhq+6jHARYBvICtAe2BrMCMwC2B7UHmQbygqqBHwZbBguAdwc/Y9uwc9h72qxCXEFnIoZD3HD/OLE5zKBYaEVoS2hqmG5Yctj7sQbhFeHZ4TXh3hGfEjIjmSEJkdOTKyJtcEy6fW83tjvKOmhV1MpoWnRi9PvpRjEOMLKZxNDo6avTq0fdirWMlsfVxII4btzrufrxt/JT4X8cQx8SPqRzzNME1YWbCmURG4qTE3YnvkkKSlifdTbZLViS3pGiljE+pTnmfGpq6KrV97Mixs8ZeSDNKE6c1pJPSU9J3pPeMCxu3ZlzHeM/xxeNvTLCdMG3CuYlGE/MmHpmkNYk36WAGISM1Y3fGJ14cr4rXk8nN3JDZzefw1/JfCoIFZYJOYYBwlfBZVkDWqqzn2QHZq7M7RUGiclGXmCNeL36dE5mzOed9blzuztz+vNS8ffnk/Iz8wxJdSa7k5GTTydMmt0kdpcXS9il+U9ZM6ZZFy3bIEfkEeUOBHjzUX1TYKX5QPCwMLKws7J2aMvXgNJ1pkmkXpztMXzL9WVF40U8z8Bn8GS0zzWfOm/lwFnvW1tnI7MzZLXMs5yyc0zE3Yu6uedR5ufN+m+8yf9X8vxakLmhcaLJw7sLHP0T8UFOsWSwrvrnIf9Hmxfhi8eLWJe5L1i35UiIoOV/qUlpe+mkpf+n5H11/rPixf1nWstblXss3rSCukKy4sTJo5a5VOquKVj1ePXp1XRmrrKTsrzWT1pwr9yjfvJa6VrG2vSKmomGd1boV6z6tF62/XhlSuW+D8YYlG95vFGy8sil4U+1mk82lmz9uEW+5tTVia12VTVX5NuK2wm1Pt6dsP/OTz0/VO4x2lO74vFOys31Xwq6T1d7V1buNdy+vQWsUNZ17xu+5vDd0b0OtU+3Wfcx9pfvBfsX+Fz9n/HzjQPSBloM+B2t/sf5lwyHGoZI6pG56XXe9qL69Ia2h7XDU4ZZG/8ZDvzr/urPJvKnyiP6R5UepRxce7T9WdKynWdrcdTz7+OOWSS13T4w9ce3kmJOtp6JPnT0dfvrEGfaZY2cDzjad8zt3+LzP+foLXhfqLnpePPSb52+HWr1a6y55X2q47Hu5sW1U29ErQVeOXw29evoa99qF67HX224k37h1c/zN9luCW89v591+fafwTt/dufcI90rua98vf2D8oOp3+9/3tXu1H3kY+vDio8RHdx/zH798In/yqWPhU/rT8mdmz6qfuz1v6gzvvPxi3IuOl9KXfV3Ff+j8seGV3atf/gz+82L32O6O17LX/W+WvjV8u/Mvj79aeuJ7HrzLf9f3vqTXsHfXB58PZz6mfnzWN/UT6VPFZ/vPjV+iv9zrz+/vl/JkvIGjAAYbmpUFwJudANDTAGDAMwR1nOouOCCI6v46gMB/wqr74oB4AVALO+UxntMMwH7YbOZCbtgrj/BJwQB1dx9qapFnubupuGjwJkTo7e9/awIAqRGAz7L+/r6N/f2ft8NgbwPQPEV1B1UKEd4ZtoQq0e3VE+aC70R1P/0mx+97oIzAA3zf/wt4tJFlaPiOogAAAIplWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAOShgAHAAAAEgAAAHigAgAEAAAAAQAABRagAwAEAAAAAQAAAzgAAAAAQVNDSUkAAABTY3JlZW5zaG90qFMKUgAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAAddpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADx4OnhtcG1ldGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IlhNUCBDb3JlIDYuMC4wIj4KICAgPHJkZ

This file has been truncated. show original

From @philschmid

I could follow most of the code, but had few doubts. Please help me to clarify these doubts.

In this code below:

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        
        self.teacher = teacher_model
        
        # place teacher on same device as student
        self._move_model_to_device(self.teacher,self.model.device)
        
        self.teacher.eval()

When I take fine-tuned teacher model it is never fine-tuned in the process of Task Specific Distillation training, as in line self.teacher.eval() mentioned in the code.? Only the output of teacher model is considered for loss calculations.

I couldn’t follow this line self._move_model_to_device(self.teacher,self.model.device). What it is actually doing?

In Task Specific Distillation training, I am fine tuning my student model, but in the DistillationTrainer I pass both models. Where it’s making sure that only student model weights are learned and not the teacher?

trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    
)

NimaBoscarino · April 13, 2022, 6:57pm

As far as I can tell, the student model is the one being passed to the Trainer's model property, and when trainer.train() gets called the Trainer will only look into its model to adjust the weights. As you pointed out, the teacher model is being set to eval mode, and it’s only being used in the overridden compute_loss function. (More info about compute_loss here: Trainer) The DistillationTrainer class is just a custom subclass, and the teacher_model won’t actually get passed into the Trainer.

I think self._move_model_to_device(self.teacher,self.model.device) just sets the teacher model to use GPU if it’s available (or at least puts it on whatever device the student model is on), since the Trainer class already does that automatically for the student model when it’s passed in: transformers/trainer.py at c4ad38e5ac69e6d96116f39df789a2369dd33c21 · huggingface/transformers · GitHub

Hope this helps!

pchhapolika · April 14, 2022, 10:34am

Yes, it helps. Few more clarification on your answer!

But in the class DistillationTrainer(Trainer): both models are passed, if you see here

trainer = DistillationTrainer(
    **student_model,**
    training_args,
    **teacher_model=teacher_model,**
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    
)

Why you say that << the student model is the one being passed to the Trainer 's model property,>>

And teacher model is just for purpose to calculate loss?

pchhapolika · April 14, 2022, 1:18pm

And one more basic doubt is:

In this code:

 loss_function = nn.KLDivLoss(reduction="batchmean")
        loss_logits = (loss_function
                       (
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)
                       ) * (self.args.temperature ** 2)
                      )

Why are we using log_softmax for student and softmax for teacher?
Also what is the purpose of KLDivLoss? What it is doing?

NimaBoscarino · April 14, 2022, 4:44pm

The DistillationTrainer is a new custom class that’s being created in your notebook, which is subclassing the Trainer class (which is from Hugging Face’s transformers). So even though you pass both the student_model and teacher_model to the DistillationTrainer, note that this section of the code:

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        # etc...

means that the teacher model is being saved in the DistillationTrainer as self.teacher, while the student model is being passed up through to the Trainer’s init function as the first parameter when you run:

trainer = DistillationTrainer(
    student_model, # This is a positional argument, captured in *args and passed to super().__init__
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

If we look at where self.teacher gets used, it’s actually only called in the compute_loss() function. (The Trainer class doesn’t use a self.teacher property)

Are you familiar with subclasses and super in Python? If not, reading into that might help a bit!

As for the other questions, they’re out of my expertise but this GitHub issue seems to explain it pretty well! (kd loss · Issue #2 · haitongli/knowledge-distillation-pytorch · GitHub) Here’s a Reddit thread specifically about KLDivLoss, which seems to be commonly used in knowledge distillation: Reddit - Dive into anything

This is pretty neat, I’ve learned a lot through researching for your question

pchhapolika · April 15, 2022, 8:00am

I learned a lot through your answers!! Thanks.

pchhapolika · June 28, 2022, 7:38am

Hi @NimaBoscarino , Why in the paper they mention that <<In our experiments, we have observed that dis- tilled models do not work well when distilled to a different model type. Therefore, we restricted our setup to avoid distilling RoBERTa model to BERT or vice versa. The major difference between the two model groups is the input token (sub-word) em- bedding. We think that different input embedding spaces result in different output embedding spaces, and knowledge transfer with different spaces does not work well>>

At the end, we are only taking KLDiversion Loss between 2 dimensional logit vector, right ? So can’t we use Roberta XLM large model as teacher and Bert base as student. Both have different tokenization. Or we can still push our code to train whatever we want, but technically its not wise to do this?

NimaBoscarino · June 30, 2022, 8:39pm

Oooh, I don’t know enough about that to comment on it I think that what they mean in the paper is that you could use those as teacher + student, but you might not be super impressed with the results. That’s my intuition though, so I might be wrong!

Venkatesh4342 · September 2, 2023, 1:15pm

can anyone help to perform task specific knowledge distillation on NER task.

Topic		Replies	Views
Knowledge distillation for NER task 🤗Transformers	0	290	August 23, 2023
Distillation: create student model from a different base model than teacher 🤗Transformers	9	2096	October 14, 2020
Custom trainer does not work on multiple GPUs 🤗Transformers	1	1425	December 21, 2021
Does it make sense to train DistilBERT from scratch in a new corpus Beginners	14	6642	April 4, 2023
Knowledge Distillation of SentenceTransformer - problems making it work Beginners	0	1064	April 9, 2022

Regarding Training a Task Specific Knowledge Distillation model

Related topics