Are transformer-based encoders just "text embeddings"?

I have an intuitive understanding of how word embeddings such as word2vect work. And also convnets. I’d like to transfer that learning (pun intended) to figure how transformers work for NLP. I’m developing an intuition and I’d like to validate or discard it before it misleads me.

Reading the introduction part of the hugginface course, they explain that an encoder model such as BERT, for instance, basically maps a given text (i.e. a bunch of tokens) into the “feature space”. Then, depending on the required task, you attach a head that maps these features into something more useful such as text classification or so. Is this all? Different models differ in their training task, i.e. their head, and some structural details, hyperparameters and so. But overall transformers work as some sort of just “text embeddings”, right? Actually, actual word embeddings are learned just like that: you put a layer of embeddings, then a dense layer to predict classes and after learning a classification task over a large, general dataset, you get the side-effect of having the embeddings learned, and then you can transfer that learning plugging a different layer after the embedding layer. Of course, a paramount difference between simple word embeddings and transformers are the attention mechanism, positional mechanism and all that, I get it. But from a general point of view, the point is that given some input, via tokenization, you get a vector which encodes “the meaning” of the input in the feature space (or latent space). By the way, I like this way of thinking because it is completely parallel to how convents work. You put some convolutional layers to pack the 2D information of images into a higher dimensional space, and then you put a dense network to predict labels (this is, a head) and so by learning how to put labels, you are learning an encoder that is able to understand images.

Is this intuition right? Are transformers just generalised “text embeddings”? I feel I must be missing something important here, as if this is the case I cannot understand how I find no one drawing the analogy with word embeddings and convnets.

Thanks a lot in advance!