Anyone have advice on best methods to cluster BERT-embedded documents?

afractalthought · August 20, 2020, 7:56pm

I am interested in using the feature extractor to get BERT embeddings for a corpus of documents. I am interested in clustering these documents (open to different algorithms/similarity metrics) at this point. However, I am assuming that dimensionality of the embeddings might be a problem. Has anyone done clustering on embeddings before? If so, what kind of dimensionality reduction did you use (if any) and how did you do the clustering or compute similarity metrics? Even if you haven’t done this before, if you have any ideas or if you can refer me to any papers/examples that would be great!

Also just want to add that I am not trying to do any kind of search (ie not interested in finding out which article is most similar to article x) which is what I mostly found online when googling this problem. Although both utilize similarity metrics, the goal is ultimately different and wanted to be clear on that. I just want to cluster the documents in order to group the articles and come up with labels for them.

Thank you for viewing this question!

Adrien · October 23, 2020, 7:46am

Hello @afractalthought,

You can try Sentence transformer which is much better for clustering from feature extraction than vanilla BERT or RoBERTa. When applying cosine similarity on the sentence embedding from this model, documents with semantic similarity should get a higher similarity score and clustering should get better.

R00 · August 31, 2021, 11:35am

try Sentence transformer which is much better for clustering from feature extraction than vanilla BERT or RoBERTa.

Hello, but why would sentence-transformer perform better than vanilla pre-trained BERT/ any-other transoformer ?

Topic		Replies	Views
Clustering news articles with sentence bert Models	15	20002	October 29, 2023
Using Roberta for Sentence2Vec Intermediate	3	1260	April 11, 2021
Extracting and adding document clustering features to a document classification model Models	0	783	March 30, 2022
What are some recommended pretrained models for extracting semantic feature on single sentence? Research	4	1503	December 14, 2020
Document Embeddings Using Sentence Transformers Beginners	0	1826	June 23, 2022

Anyone have advice on best methods to cluster BERT-embedded documents?

Related topics