I am asking my question here since I couldn’t find an answer to it anywhere.
I am a junior NLP engineer and I am experimenting some trouble with Bert Model, and more especially with its returned embeddings.
I am concerned about the fact that PAD embeddings are not the same.
I have seen some forums where it is explained that this is due to the fact that their embedding directly depend on the positional encoding; which I agree with. Nevertheless, for two simple sentences like "hello, I am a boy’ and “hello, I am a girl” in a batch of other longer sentences, these sentences would be padded and their pads would have the exact same positional encoding; yet, the pad embedding still differ in the end, even with the model in .eval() mode. It can’t be due to attention layers because of attention masks, it can’t be due to randomness of dropout since I turned it down, and it can’t be due to positional encoding because of the fact that they have exact same position for two different sentences.
Would anybody have an answer to my concerns? I understand that I can just “ignore” the pad embeddings if I just want information about the word embeddings, but I still would like to understand.
Have a nice day,