After learning about the different types of Transformer architectures (Thanks a lot for this amazingly informative course!), I have a question about the output of encoder-only models:
The video says that (at least for BERT-like) models, the output vector contains encodings for each of the input words.
It is also said, that these outputs are well suited for sentence classification tasks.
I see how an attention-based classification would be able to use the encoder output as input, attending to the most relevant words for the classification task.
But this way of encoding doesn’t seem to be suited as input for linear classifiers, right? Because linear classifiers would be sensitive to the position of individual words in the output vector. Or am I overlooking something?