Bert output for padding tokens

datistiquo · October 15, 2020, 12:23pm

Hi,

I just saw that I have still embeddings of padding tokens in my sentence. I assumes that the BERT output would be a 768 dim 0 vector.

So if I feed sentences with max length of 20 into TFBert-Model I get for every of the 20 tokens an embedding different from 0 despite if the sentences is just of length 10 and the rest is padded.

How can I get output of BERT which ignores the padded output?

I thought that is why I feed the attention masks such paddings are ignored?

Maybe I misunderstand something?

If I feed the sequence of hidden states of the output of BERT to a GlobalAveragePooling-Layer, do I need the masking tensor to avoid using padding embeddings?

datistiquo · October 26, 2020, 3:08pm

Hey,

I think I should use the attention_masking from the tokenizer for use in GlobalAveragePooling1d(x, mask=mask) , right?

Otherwise, the average of all token (padded tokens included) is built, right?

datistiquo · October 27, 2020, 3:16pm

As you use GlobalAveragePooling inside your model and train it, normally in tf.keras masking is propagated.

So, can I assume that masking is propagated from the BERT model towards this pooling layer?

Any ideas? And how could I check this?

balvisio · February 22, 2023, 3:11am

@datistiquo : You are correct in that you should use the “attention_mask” from the tokenizer in:

GlobalAveragePooling1d(x, mask=mask).

To answer your other question: No, the mask won’t be propagated from the output of BERT (“TFBertMainLayer”) since by looking at the source code of TFBertMainLayer, it doesn’t support masking. According to the keras documentation here, it should set “self.supports_mask=True” in the “init”" method.

Thus you need to set the mask explicitly. Here is an example:

from keras.layers import GlobalAveragePooling1D
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

x = GlobalAveragePooling1D()(output[0], encoded_input["attention_mask"])

Hope it helps

Topic		Replies	Views
Bert strugling with Padded sentence 🤗Transformers	0	386	August 24, 2021
Is the attention mask and tokenization taken into account? Beginners	0	352	December 7, 2021
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16504	March 8, 2022
The (hidden) meaning behind the embedding of the padding token? Awesome paper	2	6295	July 14, 2021
Bert attention mask question 🤗Transformers	4	1209	March 11, 2024

Bert output for padding tokens

Related topics