How to understand the bias term in language model head (when we tie the word embeddings)?

allanjie · September 12, 2022, 3:43pm

I was learning the masked language modeling codebase in Huggingface Transformers. Just a question to understand the language model head.

Here at the final linear layer where we project hidden size to vocab size (transformers/modeling_bert.py at f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91 · huggingface/transformers · GitHub).

self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
self.decoder.bias = self.bias

We set the bias term to zero at the moment. And later when we initialize the weight, we tie the weight of the linear layer and the word embedding.

But we don’t do such a thing for the bias term. I wonder how we can understand that and why we want to initialize the bias term as a zero vector.

github.com

huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1060-L1079


      
          
          
    Returns:
                  `nn.Module`: A torch module mapping hidden states to vocabulary.
              """
              return None  # Overwrite for models with output embeddings
          
          
def _init_weights(self, module):
              """
              Initialize the weights. This method should be overridden by derived class.
              """
              raise NotImplementedError(f"Make sure `_init_weights` is implemented for {self.__class__}")
          
          
def tie_weights(self):
              """
              Tie the weights between the input embeddings and the output embeddings.
          
          
    If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
              weights instead.
              """
              if getattr(self.config, "tie_word_embeddings", True):

    def tie_weights(self):
        """
        Tie the weights between the input embeddings and the output embeddings.
        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
        weights instead.
        """
        if getattr(self.config, "tie_word_embeddings", True):
            output_embeddings = self.get_output_embeddings()
            if output_embeddings is not None:
                self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())

        if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
            if hasattr(self, self.base_model_prefix):
                self = getattr(self, self.base_model_prefix)
            self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)

        for module in self.modules():
            if hasattr(module, "_tie_weights"):
                module._tie_weights()

My understanding:

Because the final linear weight accepts the hidden representations that have been transformed by several feed-forward layers. We might not be able to match them exactly, we need the bias term to somehow regularize them.

As I’m not sure my understanding is accurate, I would like to seek your opinions.

Topic		Replies	Views
What is the `tie_word_embeddings` option exactly doing? 🤗Transformers	3	12638	October 15, 2022
Understanding BertLMPredictionHead 🤗Transformers	3	2280	February 15, 2021
Missing keys in RobertaForMaskedLM state dict 🤗Transformers	5	2055	August 5, 2022
BertForMaskedLM model require fine-tuning? Beginners	0	644	August 7, 2022
Training with class weights 🤗Transformers	5	2834	November 18, 2023

How to understand the bias term in language model head (when we tie the word embeddings)?

Related topics