Embeddings for unsup clustering

yugen2 · April 14, 2023, 7:47pm

Hello all,

I am looking for embeddings that would perform best for an unsupervized clustering task. Sentence transformers claims to have state of the art embeddings for various tasks which seem very relevant, and this page offers comparison of models’ performance, according to which ‘all-mpnet-base-v2’ performs the best.

However, I am confused, because based on what I further read, this model is based on mpnet and finetuned on a 1B token dataset. Mpnet was trained on 160GB of data. And I couldn’t find a mention of a model based on GPT in this comparison.

But isn’t GPT state of the art embeddings? So why is there not even a discussion of how finetuning of it perform compared to the other models?

Does anybody have any insight about choosing the right model for what I’m looking for, or any of the other questions I raised?

Topic		Replies	Views
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2535	August 31, 2021
Fine-Tuning Strategies: Choosing Between microsoft/mpnet-base and sentence-transformers/all-MiniLM-L6-v2 🤗Transformers	2	563	November 15, 2024
Clustering news articles with sentence bert Models	15	20013	October 29, 2023
Performance metrics for sentence embedding models Beginners	2	967	April 25, 2022
Sentence-transformers/all-mpnet-base-v2 requires Input Text after Cleaning or Raw Text Only Models	0	592	January 6, 2022

Embeddings for unsup clustering

Related topics