BERT embeddings for padding token not 0?

Hi.

Backstory: I tried to visualize some static BERT embeddings, before the first transformer block, and was wondering if I should average them. But then what about the different sized inputs? I suspected that the embeddings for the padding token would be zero and so I could just average them all.

The Problem: While for a plain-vanilla PyTorch embedding this seems to be the case, for BERT it is not. There the embeddings are not zero. The start out as PyTorch embeddings, but end up with non-zero values. Examples below.

I can live with that, as I found a way to work around this, but I was wondering if there is something (subtle) to be learned hear? Why is the embedding not zero. How could there be a backprop going through to the embeddings that should not be accessible, masked out, anyway?

Plain-vanilla PyTorch:

emb = torch.nn.Embedding(5, 3, padding_idx=0) # Embedding(5, 3, padding_idx=0)
inp = torch.tensor([1,2,3,0,0,0,0])
emb(inp)
=> 
tensor([[-0.9628,  0.4631, -0.1923],
        [ 0.7668,  0.0380, -1.1776],
        [ 0.0938,  0.9070,  0.5080],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000]], grad_fn=<EmbeddingBackward0>)

BERT:

bert = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
bert.bert.embeddings.word_embeddings # Embedding(30522, 768, padding_idx=0)
inp = torch.tensor([  101,  2339,  2079,  2111,  2224,  1012,  4372,  2615,  5371,  2006,
         8241,  1029,  2339,  2079,  2111,  2404,  1037,  1012,  4372,  2615,
         5371,  2000,  3573,  2035,  2037,  7800,  1999,  1037,  8241,  1029,
         2065,  2619, 20578,  2015,  2009,  1025,  3475,  1005,  1056,  1996,
         1012,  4372,  2615,  8053,  7801,  2004,  2035,  1996,  2060,  6764,
         1029,  4283,   999,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0], dtype=torch.long)
bert.bert.embeddings.word_embeddings(inp)
=> 
tensor([[ 0.0138, -0.0260, -0.0237,  ...,  0.0090,  0.0067,  0.0144],
        [ 0.0214,  0.0150, -0.0675,  ...,  0.0120,  0.0240,  0.0203],
        [ 0.0066, -0.0536,  0.0063,  ..., -0.0139, -0.0531, -0.0086],
        ...,
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097],
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097],
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097]],
       grad_fn=<EmbeddingBackward0>)

Hello! :wave:

I think what’s happening is weight tying. If you create a new model from the bert-base-uncased config and run the same code you ran on its bert.embeddings.word_embeddings, you will get zeros where there are padding token indices.

However, as you saw, loading a pre-trained bert-base-uncased causes the above to not be true. I debugged the process of BertForSequenceClassification.from_pretrained("bert-base-uncased") and saw that the saved weights are already the same for the nn.Embedding layer and the classifier.

But, if HuggingFace pre-trained the model using the code in the Transformers library, BertForPreTraining has tied weights (check the post_init() method of the PreTrainedModel class), so it makes sense that the nn.Embedding layer has the same weights as the classifier.

3 Likes

Hi Ben.

Thanks for investing the time and chiming in.
I did not understand you fully. Could you please elaborate and walk me through this?

Weight tying as in using the same embeddings in multiple places? Which places are these?

Going from BertForPreTraining calling post_init() to its implementation, I find the following:

elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

Are you referring to this? I don’t see the tying. Or do you mean something else? As module.padding_idx is 0 (therefore not None) the embedding should be zeroed.

Weight tying means the classifier weights are referencing the embedding weights, making them the exact same weight tensor: if one changes, the other one changes as well, they are literally referencing the same memory location. This is used to both save in parameters, and research shows it improves model performance (see Weight Tying Explained | Papers With Code).

Regarding the code, let’s take a walk through it:

In modeling_bert.py we see the following:

class BertForPreTraining(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config)
        self.cls = BertPreTrainingHeads(config)

        # Initialize weights and apply final processing
        self.post_init()

BertForPreTraining inherits BertPreTrainedModel, which inherits PreTrainedModel in modeling_utils.py. This is where post_init() is defined, and it looks like this:

def post_init(self):
    """
    A method executed at the end of each Transformer model initialization, to execute code that needs the model's
    modules properly initialized (such as weight initialization).
    """
    self.init_weights()
    self._backward_compatibility_gradient_checkpointing()

Now, let’s look at self.init_weights():

def init_weights(self):
    """
    If needed prunes and maybe initializes weights.
    """
    # Prune heads if needed
    if self.config.pruned_heads:
        self.prune_heads(self.config.pruned_heads)

    if _init_weights:
        # Initialize weights
        self.apply(self._init_weights)

        # Tie weights should be skipped when not initializing all weights
        # since from_pretrained(...) calls tie weights anyways
        self.tie_weights()

And finally, self.tie_weights() eventually does output_embeddings.weight = input_embeddings.weight.

2 Likes

Ok the input and output embeddings are tied during pre-training, and because the next token to be predicted could be a padding, there is some backprop adjustments.

Thanks for your time, Ben.

1 Like