How to understand the bias term in language model head (when we tie the word embeddings)?

I was learning the masked language modeling codebase in Huggingface Transformers. Just a question to understand the language model head.

Here at the final linear layer where we project hidden size to vocab size (transformers/modeling_bert.py at f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91 路 huggingface/transformers 路 GitHub).

self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
self.decoder.bias = self.bias

We set the bias term to zero at the moment. And later when we initialize the weight, we tie the weight of the linear layer and the word embedding.

But we don鈥檛 do such a thing for the bias term. I wonder how we can understand that and why we want to initialize the bias term as a zero vector.

    def tie_weights(self):
        """
        Tie the weights between the input embeddings and the output embeddings.
        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
        weights instead.
        """
        if getattr(self.config, "tie_word_embeddings", True):
            output_embeddings = self.get_output_embeddings()
            if output_embeddings is not None:
                self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())

        if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
            if hasattr(self, self.base_model_prefix):
                self = getattr(self, self.base_model_prefix)
            self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)

        for module in self.modules():
            if hasattr(module, "_tie_weights"):
                module._tie_weights()

My understanding:

  1. Because the final linear weight accepts the hidden representations that have been transformed by several feed-forward layers. We might not be able to match them exactly, we need the bias term to somehow regularize them.

As I鈥檓 not sure my understanding is accurate, I would like to seek your opinions.