Hi everyone,
I would like to pre-train a DeBERTAv2 model. I have some questions about some of its specific parameters and hope you can help me with that. In short, the question would be how to get the configuration they found to work best in the paper, with the parameters that are available in the config.
The long version of the question is: can someone help me to clarify some of the parameters and their default values? In order to get a good idea what the configuration could/should look like, I loaded the microsoft/deberta-v2-xlarge
model and had a look. This here is its configuration:
DebertaV2Config {
"_name_or_path": "microsoft/deberta-v2-xlarge",
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"conv_act": "gelu",
"conv_kernel_size": 3,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1536,
"initializer_range": 0.02,
"intermediate_size": 6144,
"layer_norm_eps": 1e-07,
"max_position_embeddings": 512,
"max_relative_positions": -1,
"model_type": "deberta-v2",
"norm_rel_ebd": "layer_norm",
"num_attention_heads": 24,
"num_hidden_layers": 24,
"pad_token_id": 0,
"pooler_dropout": 0,
"pooler_hidden_act": "gelu",
"pooler_hidden_size": 1536,
"pos_att_type": [
"p2c",
"c2p"
],
"position_biased_input": false,
"position_buckets": 256,
"relative_attention": true,
"share_att_key": true,
"transformers_version": "4.23.1",
"type_vocab_size": 0,
"vocab_size": 128100
}
The parameters I looked into are:
- relative_attention: it is nice to see that this is âTrueâ, as this is one of the big improvements of this model - no question here. However, it defaults to âFalseâ in the configuration - wouldnât it be nice to have a default value that basically creates what they found to give the best performance?
- pos_att_type: those are the new attention types, they showed in the paper that if they do not use both performance goes down, so good to see they are both in there - no question here. But again, if it was shown that having the two attention types in the model gives the best results, wouldnât it be nice if that was the default in the configuration for the model config?
- max_relative_positions: I was surprised to read the description " The range of relative positions
[-max_position_embeddings, max_position_embeddings]
. Use the same value asmax_position_embeddings
." and see it set to -1. But after looking in the code, it looks like -1 makes it default to the recommendation. - position_biased_input: I am not sure exactly what this parameter controls⌠From the paper: âThe BERT model incorporates absolute positions in the input layer. In DeBERTa, we incorporate them right after all the Transformer layers but before the softmax layer for masked token predictionâ. So they incorporate the position at some point. Does this parameter give the option to make it BERT style if this is set to âTrueâ? Then it would make sense that it is âFalseâ here. It looks like the documentation is incorrect about it defaulting to âFalseâ: this here is from the model initialization:
self.position_biased_input = getattr(config, "position_biased_input", True)
, and it is set to âTrueâ by default in the config anywayposition_biased_input=True
- which is not what can be seen in the actual pre-trained model? - Looking at the previous point: if position_biased_input does not control the Enhanced Mask Decoder part of the model and cannot be switched on/off by a parameter?
There are also some parameters in here that are not mentioned in the documentation of the configuration:
- share_att_key: not sure what exactly it does, but it looks like this needs to be âFalseâ so that pos_att_type has any effect.
- position_buckets: they are not in the config and default to -1; as far as I can see from the code, stuff only happens when it is >0, so it doesnât use bucketing ever?
Lastly, I have two question regarding the tokenization. According to the model config and model card, the vocabulary is 128k. I could not find any information why it is so large. Was there any benchmarking behind this, does anyone know? The second question regards what I see when I load the pretrained tokenizer for the model:
PreTrainedTokenizerFast(name_or_path=âmicrosoft/deberta-v2-xlargeâ, vocab_size=1000, model_max_len=512, is_fast=True, padding_side=ârightâ, truncation_side=ârightâ, special_tokens={âbos_tokenâ: â[CLS]â, âeos_tokenâ: â[SEP]â, âunk_tokenâ: â[UNK]â, âsep_tokenâ: â[SEP]â, âpad_tokenâ: â[PAD]â, âcls_tokenâ: â[CLS]â, âmask_tokenâ: â[MASK]â})
That says vocabulary size 1000. Is that mismatch between 128k and 1000 real?
Thank you all for your help.