What can transformers learn without position encoding?

thecity2 · June 3, 2021, 3:52pm

So it obviously makes sense that attention mechanisms don’t have any inherent sense of position without encoding it explicitly, and for sequence prediction this seems critical. But, for example, word2vec via CBOW or skip gram is able to learn word embeddings without explicit position encoding. So my question is basically if we train a BERT model without the position encoding on the Masked LM task (something very similar to word2vec it seems to me), what is BERT capable of learning if anything? Would it be better than word2vec for creating word embeddings?

mvonwyl · June 10, 2021, 8:18am

My intuition would be that the transformers would still have a notion of context. It would still know this word appear in context with those other words, but would lose the notion of order loosely associated with position embeddings. Also, it would still allow word embeddings to change depending on the other words in context. So it would still be better than word2vec, which only has one embedding by word (learned as a combination of several contexts).

Topic		Replies	Views
Conceptual questions about transformers 🤗Transformers	10	1100	August 26, 2021
Use transformer without position embeddings being added? Beginners	0	874	June 13, 2021
Using BERT embeddings as input for transformer architecture 🤗Transformers	0	725	June 23, 2022
What are the goals in Positional Embedding methods? 🤗Transformers	2	509	March 3, 2022
Are transformer-based encoders just "text embeddings"? Beginners	0	1306	March 13, 2023

What can transformers learn without position encoding?

Related topics