Bert output for padding tokens

Hi,

I just saw that I have still embeddings of padding tokens in my sentence. I assumes that the BERT output would be a 768 dim 0 vector.

So if I feed sentences with max length of 20 into TFBert-Model I get for every of the 20 tokens an embedding different from 0 despite if the sentences is just of length 10 and the rest is padded.

How can I get output of BERT which ignores the padded output?

I thought that is why I feed the attention masks such paddings are ignored?

Maybe I misunderstand something?

If I feed the sequence of hidden states of the output of BERT to a GlobalAveragePooling-Layer, do I need the masking tensor to avoid using padding embeddings?

1 Like

Hey,

I think I should use the attention_masking from the tokenizer for use in GlobalAveragePooling1d(x, mask=mask) , right?

Otherwise, the average of all token (padded tokens included) is built, right?

As you use GlobalAveragePooling inside your model and train it, normally in tf.keras masking is propagated.

So, can I assume that masking is propagated from the BERT model towards this pooling layer?

Any ideas? And how could I check this?

@datistiquo : You are correct in that you should use the “attention_mask” from the tokenizer in:

GlobalAveragePooling1d(x, mask=mask).

To answer your other question: No, the mask won’t be propagated from the output of BERT (“TFBertMainLayer”) since by looking at the source code of TFBertMainLayer, it doesn’t support masking. According to the keras documentation here, it should set “self.supports_mask=True” in the “init”" method.

Thus you need to set the mask explicitly. Here is an example:

from keras.layers import GlobalAveragePooling1D
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

x = GlobalAveragePooling1D()(output[0], encoded_input["attention_mask"]) 

Hope it helps