DeBERTa-v3: How to keep ELECTRA-style task-head?

While DeBERTa-v2 was trained with masked language modelling (MLM), DeBERTa-v3 is an improved version pre-trained with the ELECTRA pre-training task (replaced token detection, RTD), which seems to be much more efficient than MLM. (see here:

I want to load DeBERTa-v3 with its original RTD task head. When I load it with the code below, however, Transformers always seems to delete the model head.

from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForPreTraining
model_name = "microsoft/deberta-v3-base"
discriminator = AutoModelForPreTraining.from_pretrained("microsoft/deberta-v3-base")  
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base") 

# resulting warning:
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

=> Question: is there some way to prevent Transformers from deleting the task-head?
I imagine that this is due to the fact that the DeBERTa config is for DeBERTa-v2 (with MLM head) but there is no config for v3.

I don’t have the same issue if I import ELECTRA with the code below. With ELECTRA, I can use the RTD head without a problem.

from transformers import ElectraForPreTraining, ElectraTokenizerFast
discriminator = ElectraForPreTraining.from_pretrained("google/electra-small-discriminator")  
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-small-discriminator")
1 Like

I found the location of RTD model in GitHub → “DeBERTa/ at master · microsoft/DeBERTa · GitHub

from DeBERTa.apps.models.replaced_token_detection_model import ReplacedTokenDetectionModel
from transformers import AutoTokenizer
model = ReplacedTokenDetectionModel.load_model(None,'/experiments/language_model/deberta_large.json')
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')

input = tokenizer('The Chinese sports delegation is the last to enter the 2002 Winter Olympics',return_tensors='pt')
out = model(input_ids=input['input_ids'],input_mask=input['attention_mask'],labels=torch.tensor([0,0,0,0,0,0,0,0,0,0,0,1,0,0,0])
1 Like

Thanks for sharing, I will try it out!

I tried to run the code proposed by @WENGSYX, but it doesn’t work for me. Which pytorch_model.bin are you loading and from where? It would be great if you could post the working code including the code for installing DeBERTa and downloading pytroch_model.bin

I just noticed that one of the authors is also on the forum, @DeBERTa. I really like your model and there is great research that could be done with it, but having the raw discriminator and generator would be necessary. Could you please share the models and/or code to use them?
See also the github issue here: Sharing DeBERTa-v3 discriminator and generator with task-specific heads? · Issue #89 · microsoft/DeBERTa · GitHub

For example, there is great recent research using the ELECTRA objective for few-shot learning, but they are all using the original ELECTRA instead of DeBERTa-v3, your model could be of great value there if the raw models were shared:

1 Like