Are BERT and its derivatives(like DistilBert, RoBertA,…) document embedding methods like Doc2Vec?
Do you mean they will map the words to vectors? Yes, they do, but it’s different than some methods like word2veq; I am not sure about Doc2Vec, though. For example, in word2veq, we give each word only one vector, and that’s it. This is not ideal since some words have different meanings in different contexts; for example, we have banks where we go to deposit or withdraw money, and we have river banks. Word2vec will give both banks the same vector, but in BERT, the vector is based on the context.
True, Doc2vec is like w2v but just it includes document_id. So we can use that as both w2v and d2c, right?
Such models output representations for each token in context of other tokens to the left and to the right of it. You need to aggregate these representations somehow to obtain a single vector representing a document. A common approach is to average vectors of each token, for example. I’d suggest using sentence transformers for this purpose.
You mean, that 768 features that we have in BERT output, cannot represent a document itself?
BERT output is not just 768 features, but 768 features for each token.