Clustering news articles with sentence bert

cezary · January 24, 2021, 3:36pm

Hi! I would like to cluster articles about the same topic. Now I saw that sentence bert might be a good place to start to embed sentences and then check similarity with something like cosine similarity. But since articles are build upon a lot of sentences, this method doesnt work well. Is there some bert embedding that embeds a whole text or maybe some algorithm to use the sentence embeddings on the scale of a whole text?

Thanks for any input!

lewtun · January 24, 2021, 4:04pm

Hi @cezary, since you want to cluster articles you could use any of the “encoder” Transformers (e.g. BERT-base) to extract the hidden states per article (see e.g. here for an example with IMDB) and then apply clustering / dimensionality reduction on the hidden states to identify the clusters.

If you’re dealing with an unsupervised task, I’ve found that UMAP + HDBSCAN works well for the second step.

HTH!

FL33TW00D · January 24, 2021, 4:26pm

Hi @cezary,
The suggestions from @lewtun are great and worth investing.

When I approached a problem like this a couple avenues that I explored:

Gensim TFIDF is a good baseline.
Summarization + Sentence Transformers: GitHub - UKPLab/sentence-transformers: Sentence Embeddings with BERT & XLNet
2a) Topic Clustering from Sentence Transformers: Clustering — Sentence-Transformers documentation
If you go down the BERT route, I would recommend using a BERT fine tuned to a task similar to yours and extracting the hidden states as suggested by @lewtun , an example of extracting the hidden states is here: Domain-Specific BERT Models · Chris McCormick

Best of luck,
Chris

cezary · January 25, 2021, 7:01am

Hi! Thanks!. The reason I was looking into sentence bert was that it is kind of designed for similarity related tasks if I understood correctly. Why would I use then the vanilla bert for that? Thanks!

cezary · January 25, 2021, 7:28am

Yeah the problem why I can’t really use sentence bert is that it seems not to work well if I interpret the whole text as a sentence or I guess you mean something else. I thought about the summarization first though, maybe I will use that.

marcoabrate · January 25, 2021, 8:52am

Hey @cezary!
I am currently using sentence-BERT for matching similar paragraphs in biology books and it works amazingly.

For your task, I suggest you take a look at this article and this paper.

Hope it helps.

FL33TW00D · January 25, 2021, 9:28am

Hi @marcoabrate,
Are you using sentence BERT with one of their models designed for sentences or something like Bio/SciBERT with mean pooling?

Any info would be great.

Thanks,
Chris

lewtun · January 25, 2021, 9:30am

Hi @cezary, in addition to the suggestions from @marcoabrate a crude approach would be to take the average of the sentence-level embeddings from sentence BERT. This would give you a document-level representation which although crude should serve as a strong baseline for your task

marcoabrate · January 25, 2021, 9:38am

Hello @FL33TW00D, I am using paraphrase-distilroberta-base-v1 to find important paragraphs in a book, starting from a summary. My books are not very technical though, that’s why I did not explore any topic-related model.

I compare this sentence-transformers method with word2vec, doc2vec and a personalized method that uses rouge. Sentence-transformers is the most accurate from my experiments.

I believe using bio/sciBERT with mean pooling would bring similar results. However, the model I am using is already fine-tuned for sentence comparison.

FL33TW00D · January 25, 2021, 9:42am

Hi @marcoabrate ,
Thanks for the info! For my task (clustering wikipedia entries for a certain class of medications), I found the standard BERT vocab quite lacking so tried SciBERT with mean pooling and results didn’t quite stand up to TFIDF. This is understandable though since it’s mean pooling to get the sentence vectors and then I mean pool to get paragraph vectors.

Glad you manged to get it to work though! I’ll keep it in mind for future projects.

Thanks,
Chris

cezary · January 26, 2021, 11:09am

Thanks everyone! I will consider every idea!

PokerFace · July 18, 2021, 5:59am

Thanks for proposing this question and solutions. Just wonder will the use of sentences vs. a whole document makes a great difference in topic clustering using BERT?

olaffson · August 15, 2021, 5:15pm

Hi @lewtun you mention that

If you’re dealing with an unsupervised task, I’ve found that UMAP + HDBSCAN works well for the second step.

Do you have some quick GitHub example of this? In which context have you tried this idea?

Thanks!

lewtun · August 25, 2021, 5:36pm

hey @olaffson,

i’ve used this technique to cluster short documents (emails, form responses, etc) but am not aware of an github example unfortunately. but it’s not too complicated, so you should be able to cook something up by looking at the umap / hdbscan docs

olaffson · August 25, 2021, 5:36pm

I actually did. I will be happy to share my code here as soon as I clean it

math312 · October 29, 2023, 6:04am

Thanks for reading that post. I combine the UAMP and HDBSCAN to finish the news articles clustering tasks. The UAMP package is umap-learn and HDBSCAN package is sklearn. The docs’ links are as following:
https://umap-learn.readthedocs.io/en/latest/basic_usage.html

Hope to help you!!

Topic		Replies	Views
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2532	August 31, 2021
Short text clustering Beginners	3	6899	April 30, 2021
Is there a way to split a news article into subtopic Research	4	1278	September 22, 2022
Document Embeddings Using Sentence Transformers Beginners	0	1826	June 23, 2022
Best way to use a BERT transformer on each sentence in a document? Beginners	0	452	July 13, 2021

Clustering news articles with sentence bert

Related topics