Clustering news articles with sentence bert

Hi! I would like to cluster articles about the same topic. Now I saw that sentence bert might be a good place to start to embed sentences and then check similarity with something like cosine similarity. But since articles are build upon a lot of sentences, this method doesnt work well. Is there some bert embedding that embeds a whole text or maybe some algorithm to use the sentence embeddings on the scale of a whole text?

Thanks for any input!

1 Like

Hi @cezary, since you want to cluster articles you could use any of the “encoder” Transformers (e.g. BERT-base) to extract the hidden states per article (see e.g. here for an example with IMDB) and then apply clustering / dimensionality reduction on the hidden states to identify the clusters.

If you’re dealing with an unsupervised task, I’ve found that UMAP + HDBSCAN works well for the second step.

HTH!

4 Likes

Hi @cezary,
The suggestions from @lewtun are great and worth investing.

When I approached a problem like this a couple avenues that I explored:

  1. Gensim TFIDF is a good baseline.
  2. Summarization + Sentence Transformers: GitHub - UKPLab/sentence-transformers: Sentence Embeddings with BERT & XLNet
    2a) Topic Clustering from Sentence Transformers: Clustering — Sentence-Transformers documentation
  3. If you go down the BERT route, I would recommend using a BERT fine tuned to a task similar to yours and extracting the hidden states as suggested by @lewtun , an example of extracting the hidden states is here: Domain-Specific BERT Models · Chris McCormick

Best of luck,
Chris

4 Likes

Hi! Thanks!. The reason I was looking into sentence bert was that it is kind of designed for similarity related tasks if I understood correctly. Why would I use then the vanilla bert for that? Thanks!

Yeah the problem why I can’t really use sentence bert is that it seems not to work well if I interpret the whole text as a sentence or I guess you mean something else. I thought about the summarization first though, maybe I will use that.

Hey @cezary!
I am currently using sentence-BERT for matching similar paragraphs in biology books and it works amazingly.

For your task, I suggest you take a look at this article and this paper.

Hope it helps.

4 Likes

Hi @marcoabrate,
Are you using sentence BERT with one of their models designed for sentences or something like Bio/SciBERT with mean pooling?

Any info would be great.

Thanks,
Chris

Hi @cezary, in addition to the suggestions from @marcoabrate a crude approach would be to take the average of the sentence-level embeddings from sentence BERT. This would give you a document-level representation which although crude should serve as a strong baseline for your task

1 Like

Hello @FL33TW00D, I am using paraphrase-distilroberta-base-v1 to find important paragraphs in a book, starting from a summary. My books are not very technical though, that’s why I did not explore any topic-related model.

I compare this sentence-transformers method with word2vec, doc2vec and a personalized method that uses rouge. Sentence-transformers is the most accurate from my experiments.

I believe using bio/sciBERT with mean pooling would bring similar results. However, the model I am using is already fine-tuned for sentence comparison.

Hi @marcoabrate ,
Thanks for the info! For my task (clustering wikipedia entries for a certain class of medications), I found the standard BERT vocab quite lacking so tried SciBERT with mean pooling and results didn’t quite stand up to TFIDF. This is understandable though since it’s mean pooling to get the sentence vectors and then I mean pool to get paragraph vectors.

Glad you manged to get it to work though! I’ll keep it in mind for future projects.

Thanks,
Chris

Thanks everyone! I will consider every idea!

1 Like

Thanks for proposing this question and solutions. Just wonder will the use of sentences vs. a whole document makes a great difference in topic clustering using BERT?

Hi @lewtun you mention that

If you’re dealing with an unsupervised task, I’ve found that UMAP + HDBSCAN works well for the second step.

Do you have some quick GitHub example of this? In which context have you tried this idea?

Thanks!

hey @olaffson,

i’ve used this technique to cluster short documents (emails, form responses, etc) but am not aware of an github example unfortunately. but it’s not too complicated, so you should be able to cook something up by looking at the umap / hdbscan docs :slight_smile:

1 Like

I actually did. I will be happy to share my code here as soon as I clean it

2 Likes

Thanks for reading that post. I combine the UAMP and HDBSCAN to finish the news articles clustering tasks. The UAMP package is umap-learn and HDBSCAN package is sklearn. The docs’ links are as following:
https://umap-learn.readthedocs.io/en/latest/basic_usage.html

Hope to help you!!