Short text clustering

scroobiustrip · April 28, 2021, 5:13pm

Hey folks, I’ve been using the sentence-transformers library for trying to group together short texts.

I’ve had reasonable success using the AgglomerativeClustering library from sklearn (using either euclidean distance + ward linkage or precomputed cosine + average linkage) as it’s ability to set the distance thresholds + automatically find the right number of clusters (as opposed to Kmeans) is really nice.

But while it seemingly provides great results during the first wave of clustering, it tends to struggle when finding decent groupings on outliers that slip through the net the first time (where there’s only 1-2 observations per group).

I’ve tried some other clustering methods such as:

KMedoids
HDBscan (not great)
Kmeans / Agglomerative on predefined K

But none have been as effective as hierarchical on the initial embeddings. I’ve experimented with looping through and re-clustering just with slightly tighter distance thresholds on the outliers each time, but not really sure of a way to automatically set the distances without a large amount of trial and error.

So I was wondering if anyone knew of any methods for:

a) Grouping these more effectively during the first wave
b) Better clustering together any of the remaining outliers after the first pass

Any help would be hugely appreciated - cheers!

lewtun · April 28, 2021, 6:09pm

hey @scroobiustrip, have you tried first passing the embeddings through UMAP before applying a density based clustering algorithm? there’s a nice discussion on this approach in the UMAP docs which comes with the following warning:

This is somewhat controversial, and should be attempted with care. For a good discussion of some of the issues involved in this, please see the various answers in this stackoverflow thread on clustering the results of t-SNE. Many of the points of concern raised there are salient for clustering the results of UMAP. The most notable is that UMAP, like t-SNE, does not completely preserve density. UMAP, like t-SNE, can also create false tears in clusters, resulting in a finer clustering than is necessarily present in the data.

the example in the docs actually seems to exhibit some of the problems you found with e.g. HDBSCAN in your application

scroobiustrip · April 29, 2021, 3:41pm

Cheers @lewtun that’s ace - I had attempted before, but didn’t really have much luck with finding the right parameters - I’m currently attempting with the following settings:

umap_data = umap.UMAP(n_neighbors=5, n_components=1, spread=0.5, min_dist=0.0, metric='cosine').fit_transform(embeddings)

hdb = HDBSCAN(min_cluster_size=3,
              min_samples=5,
              metric='euclidean',                      
              cluster_selection_method='eom').fit(umap_data)

But it’s leaving me with one very large cluster filled with outliers - with the rest grouped into fairly decent clusters in terms of quality - so just need to find a way of breaking this large blob down into more sensible sub-groupings I think. Thanks again for your help!

lewtun · April 30, 2021, 7:23pm

perhaps you’ve already tried this, but does reducing n_neighbours help break apart the large cluster? it also seems that you’re projecting down to 1 dimension with n_components=1, so maybe you get better separate in a higher dimensional space (harder to visualise of course )

another idea could be to combine two UMAP models (docs), one to target the large cluster, the other to tackle the remaining clusters.

Topic		Replies	Views
FDA Label Document Embedding Research	9	1486	February 19, 2021
OpenAI Embeddings with Fast Clustering Beginners	2	1072	June 14, 2023
Fine-tuning for text clustering Beginners	0	715	May 5, 2022
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2547	August 31, 2021
Using Roberta for Sentence2Vec Intermediate	3	1276	April 11, 2021

Short text clustering

Related topics