Short text clustering

Hey folks, I’ve been using the sentence-transformers library for trying to group together short texts.

I’ve had reasonable success using the AgglomerativeClustering library from sklearn (using either euclidean distance + ward linkage or precomputed cosine + average linkage) as it’s ability to set the distance thresholds + automatically find the right number of clusters (as opposed to Kmeans) is really nice.

But while it seemingly provides great results during the first wave of clustering, it tends to struggle when finding decent groupings on outliers that slip through the net the first time (where there’s only 1-2 observations per group).

I’ve tried some other clustering methods such as:

  • KMedoids
  • HDBscan (not great)
  • Kmeans / Agglomerative on predefined K

But none have been as effective as hierarchical on the initial embeddings. I’ve experimented with looping through and re-clustering just with slightly tighter distance thresholds on the outliers each time, but not really sure of a way to automatically set the distances without a large amount of trial and error.

So I was wondering if anyone knew of any methods for:

a) Grouping these more effectively during the first wave
b) Better clustering together any of the remaining outliers after the first pass

Any help would be hugely appreciated - cheers!

1 Like

hey @scroobiustrip, have you tried first passing the embeddings through UMAP before applying a density based clustering algorithm? there’s a nice discussion on this approach in the UMAP docs which comes with the following warning:

This is somewhat controversial, and should be attempted with care. For a good discussion of some of the issues involved in this, please see the various answers in this stackoverflow thread on clustering the results of t-SNE. Many of the points of concern raised there are salient for clustering the results of UMAP. The most notable is that UMAP, like t-SNE, does not completely preserve density. UMAP, like t-SNE, can also create false tears in clusters, resulting in a finer clustering than is necessarily present in the data.

the example in the docs actually seems to exhibit some of the problems you found with e.g. HDBSCAN in your application :slightly_smiling_face:

2 Likes

Cheers @lewtun that’s ace - I had attempted before, but didn’t really have much luck with finding the right parameters - I’m currently attempting with the following settings:

umap_data = umap.UMAP(n_neighbors=5, n_components=1, spread=0.5, min_dist=0.0, metric='cosine').fit_transform(embeddings)

hdb = HDBSCAN(min_cluster_size=3,
              min_samples=5,
              metric='euclidean',                      
              cluster_selection_method='eom').fit(umap_data)

But it’s leaving me with one very large cluster filled with outliers - with the rest grouped into fairly decent clusters in terms of quality - so just need to find a way of breaking this large blob down into more sensible sub-groupings I think. Thanks again for your help!

3 Likes

perhaps you’ve already tried this, but does reducing n_neighbours help break apart the large cluster? it also seems that you’re projecting down to 1 dimension with n_components=1, so maybe you get better separate in a higher dimensional space (harder to visualise of course :slightly_smiling_face:)

another idea could be to combine two UMAP models (docs), one to target the large cluster, the other to tackle the remaining clusters.

1 Like