Anyone have advice on best methods to cluster BERT-embedded documents?

I am interested in using the feature extractor to get BERT embeddings for a corpus of documents. I am interested in clustering these documents (open to different algorithms/similarity metrics) at this point. However, I am assuming that dimensionality of the embeddings might be a problem. Has anyone done clustering on embeddings before? If so, what kind of dimensionality reduction did you use (if any) and how did you do the clustering or compute similarity metrics? Even if you haven’t done this before, if you have any ideas or if you can refer me to any papers/examples that would be great!

Also just want to add that I am not trying to do any kind of search (ie not interested in finding out which article is most similar to article x) which is what I mostly found online when googling this problem. Although both utilize similarity metrics, the goal is ultimately different and wanted to be clear on that. I just want to cluster the documents in order to group the articles and come up with labels for them.

Thank you for viewing this question!

1 Like

Hello @afractalthought,

You can try Sentence transformer which is much better for clustering from feature extraction than vanilla BERT or RoBERTa. When applying cosine similarity on the sentence embedding from this model, documents with semantic similarity should get a higher similarity score and clustering should get better.