I’m trying to cluster OpenAI embeddings (Ada) using Fast Clustering, but can’t make it work.
I embedded only 9 paragraphs by doing:
features_tensor = torch.tensor(np.vstack(df.embedding.values))
The resulting shape is pretty wide: torch.Size([9, 1536])
And I try to cluster by doing:
clusters = util.community_detection(features_tensor, min_community_size=2, threshold=0.5)
The code runs indefinitely.
The machine I’m running this is below average… Inter i5-3337U + 8GB, but I don’t know if it’s a hardware and/or code issue.
OpenAI’s Ada showed better results than Sentence Transformer’s encodes.
My question is: Is the above code the best approach to using Fast Clustering? Should I reduce the the dimensionality of ‘features_tensor’ by employing UMAP or PCA?
Any help is really appreciated!
I’m just glancing at what you’ve presented and isn’t the output for torch.Size out of order? Isn’t it stating that you have nine rows and one-thousand three-hundred and thirty six columns? I’m not familiar with how openai has implemented their embeddings but in regards to your dimensionality question check out " Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP vs LDA
Visualising a high-dimensional dataset in 2D and 3D using: PCA, TSNE, UMAP and LDA" written by [Siva Sivarajah]. Also, how are tracking the training of the model?
Thanks for your reply!
I copied and pasted the output for torch.Size. And yes, it has 9 rows and 1536 cols.
I’m just looking for a quick way to cluster the embeddings.
If I encode the data using the all-mpnet-base-v2 model, I’m able to cluster them. However, if I change to Ada’s, I have the problem.