How to prevent Transformers from deleting task-head?

When I train a Transformer, I mostly get a warning similar to the one below:

# resulting warning:
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

I know that this is normally not a problem because for fine-tuning on a new task with a new classification head, we need to delete the old classification head. But what if I actually want to keep the task head from pre-training? How do I prevent Transformers from deleting the task head?

I know that for established tasks like MLM I can use AutoModelForMaskedLM etc. But What if the task head is not MLM (in my case an ELECTRA-style replaced-token-detection head, see here: DeBERTa-v3: How to keep ELECTRA-style task-head?)

Is there no solution for this issue?

Is there no way to prevent Transformers from deleting the task head when loading a model like DeBERTa-v3? I would really love to use the replaced-token-detection task head from DeBERTa-v3 (same as ELECTRA), but it always gets deleted when I load it via HuggingFace Transformers.


If you’re instantiating a DebertaV2ForMaskedLM architecture with the weights of microsoft/deberta-v3-base, it will simply discard the weights that aren’t compatible with the model architecture. So in this case, DebertaV2ForMaskedLM only includes the language modeling head on top, as can be seen here. And the head is defined as an attribute, namely cls.predictions.

However, the language modeling head parameters at microsoft/deberta-v3-base are defined as ‘lm_predictions.lm_head.dense.bias’, ‘lm_predictions.lm_head.dense.weight’, ‘lm_predictions.lm_head.LayerNorm.bias’, ‘lm_predictions.lm_head.bias’, ‘lm_predictions.lm_head.LayerNorm.weight’. Hence the authors of Debertav3 defined the head as an lm_predictions.lm_head attribute on top of the base model, causing a mismatch.

Moreover, DebertaV2ForMaskedLM doesn’t include the mask predictions head.

So, to properly instantiate all the weights from microsoft/deberta-v3-base, you need to define the exact same architecture to make sure all weights can be loaded. Showcasing a quick draft here, based on the original definition:

from transformers import DebertaV2Model, DebertaV2PreTrainedModel

class DebertaV3ForPreTraining(DebertaV2PreTrainedModel):
  def __init__(self, config):
    # base Transformer
    self.deberta = DebertaV2Model(config)

    # language modeling head
    self.lm_predictions = DebertaV3LMHead(config)

    # mask predictions head
    self.mask_predictions = DebertaV3MaskPredictionHead(config)

  def forward(self, input_ids, attention_mask=None, labels=None, position_ids=None):
    outputs = self.deberta(input_ids, attention_mask)

    sequence_output = outputs[0]
    # apply heads
    lm_logits = self.lm_predictions(sequence_output)
    mask_logits = self.mask_predictions(sequence_output, input_ids, attention_mask)

    return lm_logits, mask_logits

Here, DebertaV3LMHead should be defined as this class and DebertaV3MaskPredictionHead should be defined as this class.

This way, you’ll be able to do:

model = DebertaV3ForPreTraining.from_pretrained("microsoft/deberta-v3-base")

and all weights will be loaded. Also, feel free to contribute this head model to the library :smiley: this way people could use it directly from the library.

1 Like

Hi @nielsr, thanks a lot for your response!

I tried to follow your instructions, see the colab here: Google Colab

Now it’s throwing the error: TypeError: forward() missing 7 required positional arguments: 'ebd_weight', 'target_ids', 'input_ids', 'input_mask', 'z_states', 'attention_mask', and 'encoder'

Unfortunately, I don’t know PyTorch / Transformers well enough to see all the adaptations that have to be made to make it compatible with Transformers :confused:

(There are more and more people who are using the replaced-token-detection / ELECTRA objective for few-shot learning, but they all seem to rely on the old ELECTRA as opposed to the improved DeBERTa-v3, presumably because there is currently no easy way to use DeBERTa-v3’s RTD head. see research from Facebook GitHub - facebookresearch/ELECTRA-Fewshot-Learning: This repository contains the code for paper Prompting ELECTRA Few-Shot Learning with Discriminative Pre-Trained Models. or ACL: )