BERT embeddings for padding token not 0?

mkamp · February 10, 2022, 5:20pm

Hi.

Backstory: I tried to visualize some static BERT embeddings, before the first transformer block, and was wondering if I should average them. But then what about the different sized inputs? I suspected that the embeddings for the padding token would be zero and so I could just average them all.

The Problem: While for a plain-vanilla PyTorch embedding this seems to be the case, for BERT it is not. There the embeddings are not zero. The start out as PyTorch embeddings, but end up with non-zero values. Examples below.

I can live with that, as I found a way to work around this, but I was wondering if there is something (subtle) to be learned hear? Why is the embedding not zero. How could there be a backprop going through to the embeddings that should not be accessible, masked out, anyway?

Plain-vanilla PyTorch:

emb = torch.nn.Embedding(5, 3, padding_idx=0) # Embedding(5, 3, padding_idx=0)
inp = torch.tensor([1,2,3,0,0,0,0])
emb(inp)
=> 
tensor([[-0.9628,  0.4631, -0.1923],
        [ 0.7668,  0.0380, -1.1776],
        [ 0.0938,  0.9070,  0.5080],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000]], grad_fn=<EmbeddingBackward0>)

BERT:

bert = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
bert.bert.embeddings.word_embeddings # Embedding(30522, 768, padding_idx=0)
inp = torch.tensor([  101,  2339,  2079,  2111,  2224,  1012,  4372,  2615,  5371,  2006,
         8241,  1029,  2339,  2079,  2111,  2404,  1037,  1012,  4372,  2615,
         5371,  2000,  3573,  2035,  2037,  7800,  1999,  1037,  8241,  1029,
         2065,  2619, 20578,  2015,  2009,  1025,  3475,  1005,  1056,  1996,
         1012,  4372,  2615,  8053,  7801,  2004,  2035,  1996,  2060,  6764,
         1029,  4283,   999,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0], dtype=torch.long)
bert.bert.embeddings.word_embeddings(inp)
=> 
tensor([[ 0.0138, -0.0260, -0.0237,  ...,  0.0090,  0.0067,  0.0144],
        [ 0.0214,  0.0150, -0.0675,  ...,  0.0120,  0.0240,  0.0203],
        [ 0.0066, -0.0536,  0.0063,  ..., -0.0139, -0.0531, -0.0086],
        ...,
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097],
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097],
        [-0.0102, -0.0614, -0.0264,  ..., -0.0198, -0.0371, -0.0097]],
       grad_fn=<EmbeddingBackward0>)

github.com

huggingface/transformers/blob/b1ba03e082e5a3558100f453c49be4b26e8d2a63/src/transformers/models/bert/modeling_bert.py#L171

      
        
                    logger.info(f"Initialize PyTorch weight {name}")
                    pointer.data = torch.from_numpy(array)
                return model
            
            

            
class BertEmbeddings(nn.Module):
                """Construct the embeddings from word, position and token_type embeddings."""
            
            
    def __init__(self, config):
                    super().__init__()
                    self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
                    self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
                    self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
            
            
        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
                    # any TensorFlow checkpoint file
                    self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
                    self.dropout = nn.Dropout(config.hidden_dropout_prob)
                    # position_ids (1, len position emb) is contiguous in memory and exported when serialized
                    self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
                    self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))

beneyal · February 12, 2022, 4:12pm

Hello!

I think what’s happening is weight tying. If you create a new model from the bert-base-uncased config and run the same code you ran on its bert.embeddings.word_embeddings, you will get zeros where there are padding token indices.

However, as you saw, loading a pre-trained bert-base-uncased causes the above to not be true. I debugged the process of BertForSequenceClassification.from_pretrained("bert-base-uncased") and saw that the saved weights are already the same for the nn.Embedding layer and the classifier.

But, if HuggingFace pre-trained the model using the code in the Transformers library, BertForPreTraining has tied weights (check the post_init() method of the PreTrainedModel class), so it makes sense that the nn.Embedding layer has the same weights as the classifier.

mkamp · February 14, 2022, 4:58pm

Hi Ben.

Thanks for investing the time and chiming in.
I did not understand you fully. Could you please elaborate and walk me through this?

Weight tying as in using the same embeddings in multiple places? Which places are these?

Going from BertForPreTraining calling post_init() to its implementation, I find the following:

elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

Are you referring to this? I don’t see the tying. Or do you mean something else? As module.padding_idx is 0 (therefore not None) the embedding should be zeroed.

beneyal · February 14, 2022, 5:45pm

Weight tying means the classifier weights are referencing the embedding weights, making them the exact same weight tensor: if one changes, the other one changes as well, they are literally referencing the same memory location. This is used to both save in parameters, and research shows it improves model performance (see Weight Tying Explained | Papers With Code).

Regarding the code, let’s take a walk through it:

In modeling_bert.py we see the following:

class BertForPreTraining(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config)
        self.cls = BertPreTrainingHeads(config)

        # Initialize weights and apply final processing
        self.post_init()

BertForPreTraining inherits BertPreTrainedModel, which inherits PreTrainedModel in modeling_utils.py. This is where post_init() is defined, and it looks like this:

def post_init(self):
    """
    A method executed at the end of each Transformer model initialization, to execute code that needs the model's
    modules properly initialized (such as weight initialization).
    """
    self.init_weights()
    self._backward_compatibility_gradient_checkpointing()

Now, let’s look at self.init_weights():

def init_weights(self):
    """
    If needed prunes and maybe initializes weights.
    """
    # Prune heads if needed
    if self.config.pruned_heads:
        self.prune_heads(self.config.pruned_heads)

    if _init_weights:
        # Initialize weights
        self.apply(self._init_weights)

        # Tie weights should be skipped when not initializing all weights
        # since from_pretrained(...) calls tie weights anyways
        self.tie_weights()

And finally, self.tie_weights() eventually does output_embeddings.weight = input_embeddings.weight.

mkamp · February 14, 2022, 7:37pm

Ok the input and output embeddings are tied during pre-training, and because the next token to be predicted could be a padding, there is some backprop adjustments.

Thanks for your time, Ben.

Topic		Replies	Views
Question about Bert padding part when calcualting similarity matrix Beginners	2	688	May 13, 2022
The (hidden) meaning behind the embedding of the padding token? Awesome paper	2	6282	July 14, 2021
Bert output for padding tokens Beginners	3	3287	February 22, 2023
PyTorch Bilinear messing with HuggingFace BERT?! Beginners	0	626	February 22, 2022
BERT for NER output of only '0' Beginners	0	671	November 14, 2021

BERT embeddings for padding token not 0?

Related topics