OpenAI Embeddings with Fast Clustering

alvaromseixas · June 9, 2023, 1:54pm

Hello,

I’m trying to cluster OpenAI embeddings (Ada) using Fast Clustering, but can’t make it work.

I embedded only 9 paragraphs by doing:
features_tensor = torch.tensor(np.vstack(df.embedding.values))

The resulting shape is pretty wide: torch.Size([9, 1536])

And I try to cluster by doing:
clusters = util.community_detection(features_tensor, min_community_size=2, threshold=0.5)

The code runs indefinitely.

The machine I’m running this is below average… Inter i5-3337U + 8GB, but I don’t know if it’s a hardware and/or code issue.

OpenAI’s Ada showed better results than Sentence Transformer’s encodes.

My question is: Is the above code the best approach to using Fast Clustering? Should I reduce the the dimensionality of ‘features_tensor’ by employing UMAP or PCA?

Any help is really appreciated!

rich0schwartz · June 13, 2023, 8:55am

I’m just glancing at what you’ve presented and isn’t the output for torch.Size out of order? Isn’t it stating that you have nine rows and one-thousand three-hundred and thirty six columns? I’m not familiar with how openai has implemented their embeddings but in regards to your dimensionality question check out " Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP vs LDA
Visualising a high-dimensional dataset in 2D and 3D using: PCA, TSNE, UMAP and LDA" written by [Siva Sivarajah]. Also, how are tracking the training of the model?

alvaromseixas · June 14, 2023, 6:54pm

Thanks for your reply!

I copied and pasted the output for torch.Size. And yes, it has 9 rows and 1536 cols.

I’m just looking for a quick way to cluster the embeddings.

If I encode the data using the all-mpnet-base-v2 model, I’m able to cluster them. However, if I change to Ada’s, I have the problem.

Tks

Topic		Replies	Views
Short text clustering Beginners	3	6900	April 30, 2021
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2532	August 31, 2021
Embeddig model information Beginners	6	127	October 20, 2024
Tabular Data Autoencoder Loss Plateau Intermediate	0	363	September 28, 2021
Extracting and adding document clustering features to a document classification model Models	0	783	March 30, 2022

OpenAI Embeddings with Fast Clustering

Related topics