Hi! I would like to cluster articles about the same topic. Now I saw that sentence bert might be a good place to start to embed sentences and then check similarity with something like cosine similarity. But since articles are build upon a lot of sentences, this method doesnt work well. Is there some bert embedding that embeds a whole text or maybe some algorithm to use the sentence embeddings on the scale of a whole text?
Hi @cezary, since you want to cluster articles you could use any of the “encoder” Transformers (e.g. BERT-base) to extract the hidden states per article (see e.g. here for an example with IMDB) and then apply clustering / dimensionality reduction on the hidden states to identify the clusters.
If you’re dealing with an unsupervised task, I’ve found that UMAP + HDBSCAN works well for the second step.
If you go down the BERT route, I would recommend using a BERT fine tuned to a task similar to yours and extracting the hidden states as suggested by @lewtun , an example of extracting the hidden states is here: Domain-Specific BERT Models · Chris McCormick
Hi! Thanks!. The reason I was looking into sentence bert was that it is kind of designed for similarity related tasks if I understood correctly. Why would I use then the vanilla bert for that? Thanks!
Yeah the problem why I can’t really use sentence bert is that it seems not to work well if I interpret the whole text as a sentence or I guess you mean something else. I thought about the summarization first though, maybe I will use that.
Hi @cezary, in addition to the suggestions from @marcoabrate a crude approach would be to take the average of the sentence-level embeddings from sentence BERT. This would give you a document-level representation which although crude should serve as a strong baseline for your task
Hello @FL33TW00D, I am using paraphrase-distilroberta-base-v1 to find important paragraphs in a book, starting from a summary. My books are not very technical though, that’s why I did not explore any topic-related model.
I compare this sentence-transformers method with word2vec, doc2vec and a personalized method that uses rouge. Sentence-transformers is the most accurate from my experiments.
I believe using bio/sciBERT with mean pooling would bring similar results. However, the model I am using is already fine-tuned for sentence comparison.
Hi @marcoabrate ,
Thanks for the info! For my task (clustering wikipedia entries for a certain class of medications), I found the standard BERT vocab quite lacking so tried SciBERT with mean pooling and results didn’t quite stand up to TFIDF. This is understandable though since it’s mean pooling to get the sentence vectors and then I mean pool to get paragraph vectors.
Glad you manged to get it to work though! I’ll keep it in mind for future projects.
Thanks for proposing this question and solutions. Just wonder will the use of sentences vs. a whole document makes a great difference in topic clustering using BERT?
i’ve used this technique to cluster short documents (emails, form responses, etc) but am not aware of an github example unfortunately. but it’s not too complicated, so you should be able to cook something up by looking at the umap / hdbscan docs
Thanks for reading that post. I combine the UAMP and HDBSCAN to finish the news articles clustering tasks. The UAMP package is umap-learn and HDBSCAN package is sklearn. The docs’ links are as following: https://umap-learn.readthedocs.io/en/latest/basic_usage.html