DeBERTa-v3: How to keep ELECTRA-style task-head?

MoritzLaurer · February 11, 2022, 2:26pm

While DeBERTa-v2 was trained with masked language modelling (MLM), DeBERTa-v3 is an improved version pre-trained with the ELECTRA pre-training task (replaced token detection, RTD), which seems to be much more efficient than MLM. (see here: https://arxiv.org/pdf/2111.09543.pdf)

I want to load DeBERTa-v3 with its original RTD task head. When I load it with the code below, however, Transformers always seems to delete the model head.

from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForPreTraining
model_name = "microsoft/deberta-v3-base"
discriminator = AutoModelForPreTraining.from_pretrained("microsoft/deberta-v3-base")  
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base") 

# resulting warning:
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

=> Question: is there some way to prevent Transformers from deleting the task-head?
I imagine that this is due to the fact that the DeBERTa config is for DeBERTa-v2 (with MLM head) but there is no config for v3.

I don’t have the same issue if I import ELECTRA with the code below. With ELECTRA, I can use the RTD head without a problem.

from transformers import ElectraForPreTraining, ElectraTokenizerFast
discriminator = ElectraForPreTraining.from_pretrained("google/electra-small-discriminator")  
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-small-discriminator")

WENGSYX · March 18, 2022, 7:12pm

I found the location of RTD model in GitHub → “DeBERTa/replaced_token_detection_model.py at master · microsoft/DeBERTa · GitHub”

from DeBERTa.apps.models.replaced_token_detection_model import ReplacedTokenDetectionModel
from transformers import AutoTokenizer
model = ReplacedTokenDetectionModel.load_model(None,'/experiments/language_model/deberta_large.json')
model.load_state_dict(torch.load('pytorch_model.bin')
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')

input = tokenizer('The Chinese sports delegation is the last to enter the 2002 Winter Olympics',return_tensors='pt')
out = model(input_ids=input['input_ids'],input_mask=input['attention_mask'],labels=torch.tensor([0,0,0,0,0,0,0,0,0,0,0,1,0,0,0])
print(out['logits'])

MoritzLaurer · March 20, 2022, 5:36pm

Thanks for sharing, I will try it out!

MoritzLaurer · July 29, 2022, 10:27am

I tried to run the code proposed by @WENGSYX, but it doesn’t work for me. Which pytorch_model.bin are you loading and from where? It would be great if you could post the working code including the code for installing DeBERTa and downloading pytroch_model.bin

I just noticed that one of the authors is also on the forum, @DeBERTa. I really like your model and there is great research that could be done with it, but having the raw discriminator and generator would be necessary. Could you please share the models and/or code to use them?
See also the github issue here: Sharing DeBERTa-v3 discriminator and generator with task-specific heads? · Issue #89 · microsoft/DeBERTa · GitHub

For example, there is great recent research using the ELECTRA objective for few-shot learning, but they are all using the original ELECTRA instead of DeBERTa-v3, your model could be of great value there if the raw models were shared: https://arxiv.org/pdf/2205.15223.pdf

chg0901 · January 10, 2024, 10:54am

hello, @MoritzLaurer, did you make it about running DeBERTa-v3 with its generator?

how did you do to use it compared with the old version without generator?

MoritzLaurer · January 10, 2024, 2:20pm

didn’t pursue this further unfortunately

Topic		Replies	Views
How to prevent Transformers from deleting task-head? Beginners	4	1191	July 29, 2022
DebertaForMaskedLM cannot load the parameters in the MLM head from microsoft/deberta-base Models	3	1324	April 29, 2022
Loading the Mdeberta-v3-base Models	5	19	March 13, 2025
Cant load deberta tokenizer Beginners	0	677	March 27, 2021
Pre-trained DeBERTa - Weak MLM performance any hints? Research	1	277	July 21, 2023

DeBERTa-v3: How to keep ELECTRA-style task-head?

Related topics