DeBERTa-v3: How to keep ELECTRA-style task-head?

While DeBERTa-v2 was trained with masked language modelling (MLM), DeBERTa-v3 is an improved version pre-trained with the ELECTRA pre-training task (replaced token detection, RTD), which seems to be much more efficient than MLM. (see here: https://arxiv.org/pdf/2111.09543.pdf)

I want to load DeBERTa-v3 with its original RTD task head. When I load it with the code below, however, Transformers always seems to delete the model head.

from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForPreTraining
model_name = "microsoft/deberta-v3-base"
discriminator = AutoModelForPreTraining.from_pretrained("microsoft/deberta-v3-base")  
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base") 

# resulting warning:
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

=> Question: is there some way to prevent Transformers from deleting the task-head?
I imagine that this is due to the fact that the DeBERTa config is for DeBERTa-v2 (with MLM head) but there is no config for v3.

I don’t have the same issue if I import ELECTRA with the code below. With ELECTRA, I can use the RTD head without a problem.

from transformers import ElectraForPreTraining, ElectraTokenizerFast
discriminator = ElectraForPreTraining.from_pretrained("google/electra-small-discriminator")  
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-small-discriminator")

I found the location of RTD model in GitHub → “DeBERTa/replaced_token_detection_model.py at master · microsoft/DeBERTa · GitHub

from DeBERTa.apps.models.replaced_token_detection_model import ReplacedTokenDetectionModel
from transformers import AutoTokenizer
model = ReplacedTokenDetectionModel.load_model(None,'/experiments/language_model/deberta_large.json')
model.load_state_dict(torch.load('pytorch_model.bin')
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')

input = tokenizer('The Chinese sports delegation is the last to enter the 2002 Winter Olympics',return_tensors='pt')
out = model(input_ids=input['input_ids'],input_mask=input['attention_mask'],labels=torch.tensor([0,0,0,0,0,0,0,0,0,0,0,1,0,0,0])
print(out['logits'])

Thanks for sharing, I will try it out!