Bert output for padding tokens


I just saw that I have still embeddings of padding tokens in my sentence. I assumes that the BERT output would be a 768 dim 0 vector.

So if I feed sentences with max length of 20 into TFBert-Model I get for every of the 20 tokens an embedding different from 0 despite if the sentences is just of length 10 and the rest is padded.

How can I get output of BERT which ignores the padded output?

I thought that is why I feed the attention masks such paddings are ignored?

Maybe I misunderstand something?

If I feed the sequence of hidden states of the output of BERT to a GlobalAveragePooling-Layer, do I need the masking tensor to avoid using padding embeddings?


I think I should use the attention_masking from the tokenizer for use in GlobalAveragePooling1d(x, mask=mask) , right?

Otherwise, the average of all token (padded tokens included) is built, right?

As you use GlobalAveragePooling inside your model and train it, normally in tf.keras masking is propagated.

So, can I assume that masking is propagated from the BERT model towards this pooling layer?

Any ideas? And how could I check this?