I just saw that I have still embeddings of padding tokens in my sentence. I assumes that the BERT output would be a 768 dim 0 vector.
So if I feed sentences with max length of 20 into TFBert-Model I get for every of the 20 tokens an embedding different from 0 despite if the sentences is just of length 10 and the rest is padded.
How can I get output of BERT which ignores the padded output?
I thought that is why I feed the attention masks such paddings are ignored?
Maybe I misunderstand something?
If I feed the sequence of hidden states of the output of BERT to a GlobalAveragePooling-Layer, do I need the masking tensor to avoid using padding embeddings?