So it obviously makes sense that attention mechanisms don’t have any inherent sense of position without encoding it explicitly, and for sequence prediction this seems critical. But, for example, word2vec via CBOW or skip gram is able to learn word embeddings without explicit position encoding. So my question is basically if we train a BERT model without the position encoding on the Masked LM task (something very similar to word2vec it seems to me), what is BERT capable of learning if anything? Would it be better than word2vec for creating word embeddings?
My intuition would be that the transformers would still have a notion of context. It would still know this word appear in context with those other words, but would lose the notion of order loosely associated with position embeddings. Also, it would still allow word embeddings to change depending on the other words in context. So it would still be better than word2vec, which only has one embedding by word (learned as a combination of several contexts).