Question about Bert padding part when calcualting similarity matrix

Hi, all, I’m new to Bert. When I use Bert to calculate the word embedding for one sentence, If I specify the max_length and padding is True. Then I will get the embedding matrix of [1, max_length, 768]. such that:

tz = BertTokenizer.from_pretrained("bert-base-cased")

# The senetence to be encoded
sent = "deep learning!"

# Encode the sentence
encoded = tz.encode_plus(
    text=sent,  # the sentence to be encoded
    add_special_tokens=True,  # Add [CLS] and [SEP]
    max_length = 10,  # maximum length of a sentence
    pad_to_max_length=True,  # Add [PAD]s
    return_tensors = 'pt',  # ask the function to return PyTorch tensors
)
# {'input_ids': tensor([[ 101, 1996, 3776,  106,  102,    0,    0,    0,    0,    0]]), 
# 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
# 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}

output0 = model(**encoded)[0] # size: [1, 10, 768]
tensor([[[-0.1983,  0.2368, -0.0717,  ..., -0.2809,  0.0442,  0.0142],
         [ 0.2063, -0.7768, -0.0193,  ..., -0.4573,  0.1309, -0.1332],
         [-0.3191, -0.9506,  0.1103,  ..., -0.2109,  0.1079, -0.2463],
         ...,
         [-0.1037, -0.0860,  0.3230,  ..., -0.1607,  0.0107, -0.0919],
         [-0.3363, -0.2425,  0.2199,  ..., -0.0281,  0.0726, -0.0862],
         [-0.2224, -0.1177,  0.3508,  ..., -0.1924,  0.0387,  0.0186]]],
       grad_fn=<NativeLayerNormBackward0>)

However, if I don’t specify the max_length and padding, then I will get the following embedding:

encoded = tz.encode_plus(
    text=sent,  # the sentence to be encoded
    add_special_tokens=True,  # Add [CLS] and [SEP]
    max_length = 64,  # maximum length of a sentence
    pad_to_max_length=False,  # Add [PAD]s
    return_attention_mask = True,  # Generate the attention mask
    return_tensors = 'pt',  # ask the function to return PyTorch tensors
)
# {'input_ids': tensor([[ 101, 1996, 3776,  106,  102]]), 
# 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 
# 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

output1 = model(**encoded)[0] # size: [1, 5, 768]
tensor([[[-0.1983,  0.2368, -0.0717,  ..., -0.2809,  0.0442,  0.0142],
         [ 0.2063, -0.7768, -0.0193,  ..., -0.4573,  0.1309, -0.1332],
         [-0.3191, -0.9506,  0.1103,  ..., -0.2109,  0.1079, -0.2463],
         [-0.2446, -0.1497, -0.2388,  ...,  0.3796,  0.3198, -0.1393],
         [ 0.7812,  0.2023, -0.2524,  ...,  0.1389, -0.7544, -0.2721]]],
       grad_fn=<NativeLayerNormBackward0>)

I have 2 questions about it:

  1. Why padding position also has the word embedding? Does those padding position embedding meanful?
  2. If I want to calculate the similarity matrix between two sentences’ word embedding, should I mask the padding embedding to 0 and then calculate the similarity matrix? or how could I calculate the similarity matrix for two sentences that has padding embeddings? for example:

Hi, wyuancs!

There is no padding in your second example. There is no word embedding for any padding position in what you have shown. I think perhaps you are mistaking the vector embedding for the [CLS] and [SEP] tokens as being some sort of padding - that isn’t the case. This is part of the BERT tokeniser that adds these in. You have specifically requested them and commented on them in the tokeniser call (add_special_tokens = True).

Regarding your second question, I don’t think there is a standard term called a “similarity matrix” so because this is underspecified, the different applications might warrant different preprocessing steps. You would need to provide a bit more information about what it is you are trying to do. The vector representation of the [CLS] token is usually taken to be a good vector embedding of the whole sentence, i.e. as a combination of all its sub-parts. You might find that a good way to calculate similarity. You definitely don’t want to include embeddings of padding indices in that calculation. It’s just really important to point out that there is absolutely no padding in your second example. There are situations where I wouldn’t look at embeddings for [CLS] / [SEP] in sentence similarity. Again, without knowing more about what it is you want to look at, it’s hard to give useful advice on what exactly needs to happen.

Hi alxmrphi,

Thank you so much for your response. For the first example, it exists paddings. For example, I have two sentences with the same max_length, and they both have [CLS] and [SEP] and paddings. I want to calculate the similarity matrix for those two sentences. What should I do?