Does it make sense to train DistilBERT from scratch in a new corpus

Hi! First post in the forums, excited to start getting deep into this great library!

I have a rookie, theoretical question. I have been reading the DistilBERT paper (fantastic!) and was wondering if it makes sense to pretrain a DistilBERT model from scratch.

In the paper, the authors specify that “The student is trained with a distillation loss over the soft target probabilities of the teacher.”. My question is, when pretraining DistilBERT on a new corpus (say, another language) what are the ‘probabilities of the teacher’? AFAIK, the teacher does not have any interesting probabilites to show since it has never seen the corpus either.

So my question is, how does the transfomers library distill knowledge into the model when I train DistilBertForMaskedLM from scratch in a brand new corpus? Sorry in advance if there is something really obvious I’m missing, I’m quite new to using transformers.

Just to be extra explicit, I would load my model like this:

config = DistilBertConfig(vocab_size=VOCAB_SIZE)
model = DistilBertForMaskedLM(config)

and train it like this:

trainer = Trainer(
1 Like

Hi @lesscomfortable welcome to the forum!

In the DistilBERT paper they use bert-base-uncased as the teacher for pretraining (i.e. masked language modelling). In particular, the DistilBERT student is pretrained on the same corpus as BERT (Toronto Books + Wikipedia) which is probably quite important for being able to effectively transfer the knowledge from the teacher to the student.

So the answer to your question

My question is, when pretraining DistilBERT on a new corpus (say, another language) what are the ‘probabilities of the teacher’? AFAIK, the teacher does not have any interesting probabilites to show since it has never seen the corpus either.

is that the pretrained BERT teacher generates logits and hidden states that can be used to guide the pretraining of the student (through the KL divergence and “cosine embedding” terms in the loss function).

You can find more technical details here and you should check out the module to see how the loss is implemented and for the pretraining logic.

If you want to use a trainer my suggestion would be to subclass Trainer, add the teacher as an attribute and override the compute_loss function, e.g.

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model

    def compute_loss(self, model, inputs):
        # adapt the code from here

Then you could initialise the teacher and student along the lines you did

teacher_model = BertForMaskedLM.from_pretrained('bert-base-uncased')

student_config = DistilBertConfig(vocab_size=VOCAB_SIZE)
student_model = DistilBertForMaskedLM(student_config)

trainer = DistillationTrainer(
        teacher = teacher_model,


  • I do not know how well this will work if your corpus is significantly different from the one BERT was pretrained on. For example, if your corpus is in another language then you’d be better off using a different teacher model in that language
  • This approach is likely to be error-prone and expensive ($$$), so my suggestion is to use the battle-tested scripts from the link above

Hope that helps!

1 Like

Hey @lewtun thanks for the prompt answer!

The link is very clear on how to correctly use the API for distillation. To distill DistilBERT in a new language, I understand I should first pretrain BERT from scratch with my full corpus and then use the script to initialize and distil the knowledge into the smaller model.

I have an additional question. What I did originally was to pretrain DistilBERT directly on my new corpus.

What is going on under the hood here? I understand that the difference is that, in the way that I did it, the supervision is only in regards to the hard target while distillation would transfer the knowledge over the soft targets. This second way is superior since as Victor Sanh clearly explains in his blogpost:

This loss is a richer training signal since a single example enforces much more constraint than a single hard target.

So my training would still be valid but I should expect less accuracy than if I did the pretraining through distillation. Is this correct?

Yes, your pretraining approach with DistilBERT is perfectly valid since in that application you’re simply using the model architecture without any knowledge distillation from a teacher. (i.e. “under the hood” it’s no different to pretraining BERT or GPT-2 etc from scratch).

So my training would still be valid but I should expect less accuracy than if I did the pretraining through distillation. Is this correct?

In my experience it is generally true that distilled models perform better than training the same model from scratch. Out of curiosity, what is your use-case? You might find that someone has already pretrained a language model for your domain / language which would be a better starting point than training from scratch :slight_smile:

I am pre-training on product reviews written in Argentinean Spanish to generate meaningful embeddings that I can later use to understand properties about a review such as: what aspects of the purchase is it referring to, how positive or negative was the buying experience etc.

There is a great deal of data about this and the domain-language is quite specific. Using a pretrained multilingual model has two problems:

  1. The vocabulary is generally very large and most of the tokens are irrelevant for my use case (this makes training and inference slower).
  2. The vocabulary misses some domain-specific language idioms that are frequently used by users and are important to understand the overall sense of the phrase.

This is why I think pretraining would be a better choice for my use case.

Thanks for the information @lesscomfortable - in that case I wonder whether it would be faster / cheaper to just fine-tune an existing Spanish language model like this one on your corpus: dccuchile/bert-base-spanish-wwm-uncased · Hugging Face

(There’s already a script for doing this here: transformers/examples/language-modeling at master · huggingface/transformers · GitHub)

By all means pretrain from scratch if you have enough text data / compute, but the above suggestion would serve as a quick benchmark to compare against :slight_smile:

PS. I realise that there are quite some differences between Spanish from Argentina vs Spain, and I’m not sure what the Spanish BERT model was trained on …

1 Like

It is definitely an interesting experiment to try and an interesting benchmark to have.

It would be great if I could use my own vocabulary while also leveraging the knowledge of this pretrained model. This would allow me to keep idioms and remove useless tokens from the vocab. Do you know if there is any way to change the vocabulary before finetuning? Haven’t found it in these scripts.

As far as I know, if you want to change the vocabulary in a significant way you’ll have to train the tokenizer from scratch, which also means doing the pretraining from scratch as well.

One thing I’ve seen in the docs before is the add_tokens method: Utilities for Tokenizers — transformers 4.3.0 documentation

Depending on how much you need to change the base vocabulary of the pretrained model, this might be a good start (although I’ve never tried this before so the results might be unexpected …)

1 Like

That utility might be useful, thanks for the help Lewis!

1 Like

hi there, i checked the link and its unavailable. do you have other source to do this? and im still confused about the train_dataset and eval_dataset, because all i have is corpus of .txt . been following the step by step in How to train a new language model from scratch using Transformers and Tokenizers but now im stuck at training the model.

hi there, im trying to do what you are doing with my corpus. could you please provide some guides on how to do this? i just finished mining my .txt corpus

hey @imtrying, the examples have recently been split into the different frameworks that transformers supports, so you can find the script for language modeling in pytorch here: transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub

which part is causing you confusion? how to create the datasets or something else?