Pre-training DeBERTaV2 - config questions

Hi everyone,

I would like to pre-train a DeBERTAv2 model. I have some questions about some of its specific parameters and hope you can help me with that. In short, the question would be how to get the configuration they found to work best in the paper, with the parameters that are available in the config.

The long version of the question is: can someone help me to clarify some of the parameters and their default values? In order to get a good idea what the configuration could/should look like, I loaded the microsoft/deberta-v2-xlarge model and had a look. This here is its configuration:

DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v2-xlarge",
  "attention_head_size": 64,
  "attention_probs_dropout_prob": 0.1,
  "conv_act": "gelu",
  "conv_kernel_size": 3,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 6144,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 24,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 1536,
  "pos_att_type": [
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transformers_version": "4.23.1",
  "type_vocab_size": 0,
  "vocab_size": 128100

The parameters I looked into are:

  • relative_attention: it is nice to see that this is ‘True’, as this is one of the big improvements of this model - no question here. However, it defaults to ‘False’ in the configuration - wouldn’t it be nice to have a default value that basically creates what they found to give the best performance?
  • pos_att_type: those are the new attention types, they showed in the paper that if they do not use both performance goes down, so good to see they are both in there - no question here. But again, if it was shown that having the two attention types in the model gives the best results, wouldn’t it be nice if that was the default in the configuration for the model config?
  • max_relative_positions: I was surprised to read the description " The range of relative positions [-max_position_embeddings, max_position_embeddings] . Use the same value as max_position_embeddings ." and see it set to -1. But after looking in the code, it looks like -1 makes it default to the recommendation.
  • position_biased_input: I am not sure exactly what this parameter controls… From the paper: “The BERT model incorporates absolute positions in the input layer. In DeBERTa, we incorporate them right after all the Transformer layers but before the softmax layer for masked token prediction”. So they incorporate the position at some point. Does this parameter give the option to make it BERT style if this is set to ‘True’? Then it would make sense that it is ‘False’ here. It looks like the documentation is incorrect about it defaulting to ‘False’: this here is from the model initialization: self.position_biased_input = getattr(config, "position_biased_input", True), and it is set to ‘True’ by default in the config anyway position_biased_input=True - which is not what can be seen in the actual pre-trained model?
  • Looking at the previous point: if position_biased_input does not control the Enhanced Mask Decoder part of the model and cannot be switched on/off by a parameter?

There are also some parameters in here that are not mentioned in the documentation of the configuration:

  • share_att_key: not sure what exactly it does, but it looks like this needs to be ‘False’ so that pos_att_type has any effect.
  • position_buckets: they are not in the config and default to -1; as far as I can see from the code, stuff only happens when it is >0, so it doesn’t use bucketing ever?

Lastly, I have two question regarding the tokenization. According to the model config and model card, the vocabulary is 128k. I could not find any information why it is so large. Was there any benchmarking behind this, does anyone know? The second question regards what I see when I load the pretrained tokenizer for the model:

PreTrainedTokenizerFast(name_or_path=‘microsoft/deberta-v2-xlarge’, vocab_size=1000, model_max_len=512, is_fast=True, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘bos_token’: ‘[CLS]’, ‘eos_token’: ‘[SEP]’, ‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’})

That says vocabulary size 1000. Is that mismatch between 128k and 1000 real?

Thank you all for your help.

The transformers implementation is not suitable for pretraining, so I’d recommend sticking to pretraining roberta rather than deberta. Alternatively, you could use the original repository: GitHub - microsoft/DeBERTa: The implementation of DeBERTa

The v2 models are absurdly large, so it will be very slow and expensive to pretrain them at that size. Deberta is also about twice as slow as roberta due to the attention changes, and its irregular mechanisms mean that most optimization libraries cannot accommodate it.

If you want to learn more about the model, I’d recommend going through the original repo. There are configurations in there, and you’ll have to piece together the story and make some assumptions. The model author is not responsive and some information is virtually unknown to the public at this point.

128k vocab is correct. Ignore what the tokenizer says. len(tokenizer) should be 128k. Like I said, the reason why it is that large is not clear. From v1 to v2, they switched from GPT2/Roberta BPE tokenizer to sentencepiece, but that doesn’t explain why it doubled in size.

Sorry there isn’t more info. Also, there is a deberta v3 which is even better, but there are many unknowns about that as well.

Hi Nicholas,

thank you very much for all that information!

What brought me to the idea of pretraining my own DeBERTa was that I already (successfully) trained my own RoBERTa and thought maybe I can try another model that has some fancy new bells and whistles :smile: The tutorials for putting together the training of a custom RoBERTa are really good, so I thought maybe I can ‘just’ switch out the model part.

Is there are list of the transformer models that are suitable for pretraining?