My intuition would be that the transformers would still have a notion of context. It would still know this word appear in context with those other words, but would lose the notion of order loosely associated with position embeddings. Also, it would still allow word embeddings to change depending on the other words in context. So it would still be better than word2vec, which only has one embedding by word (learned as a combination of several contexts).