While DeBERTa-v2 was trained with masked language modelling (MLM), DeBERTa-v3 is an improved version pre-trained with the ELECTRA pre-training task (replaced token detection, RTD), which seems to be much more efficient than MLM. (see here: https://arxiv.org/pdf/2111.09543.pdf)
I want to load DeBERTa-v3 with its original RTD task head. When I load it with the code below, however, Transformers always seems to delete the model head.
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForPreTraining
model_name = "microsoft/deberta-v3-base"
discriminator = AutoModelForPreTraining.from_pretrained("microsoft/deberta-v3-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
# resulting warning:
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=> Question: is there some way to prevent Transformers from deleting the task-head?
I imagine that this is due to the fact that the DeBERTa config is for DeBERTa-v2 (with MLM head) but there is no config for v3.
I don’t have the same issue if I import ELECTRA with the code below. With ELECTRA, I can use the RTD head without a problem.
from transformers import ElectraForPreTraining, ElectraTokenizerFast
discriminator = ElectraForPreTraining.from_pretrained("google/electra-small-discriminator")
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-small-discriminator")