Hey folks, I’ve been using the
sentence-transformers library for trying to group together short texts.
I’ve had reasonable success using the AgglomerativeClustering library from sklearn (using either euclidean distance +
ward linkage or precomputed cosine +
average linkage) as it’s ability to set the distance thresholds + automatically find the right number of clusters (as opposed to Kmeans) is really nice.
But while it seemingly provides great results during the first wave of clustering, it tends to struggle when finding decent groupings on outliers that slip through the net the first time (where there’s only 1-2 observations per group).
I’ve tried some other clustering methods such as:
- HDBscan (not great)
- Kmeans / Agglomerative on predefined K
But none have been as effective as hierarchical on the initial embeddings. I’ve experimented with looping through and re-clustering just with slightly tighter distance thresholds on the outliers each time, but not really sure of a way to automatically set the distances without a large amount of trial and error.
So I was wondering if anyone knew of any methods for:
a) Grouping these more effectively during the first wave
b) Better clustering together any of the remaining outliers after the first pass
Any help would be hugely appreciated - cheers!