I am looking for embeddings that would perform best for an unsupervized clustering task. Sentence transformers claims to have state of the art embeddings for various tasks which seem very relevant, and this page offers comparison of models’ performance, according to which ‘all-mpnet-base-v2’ performs the best.
However, I am confused, because based on what I further read, this model is based on mpnet and finetuned on a 1B token dataset. Mpnet was trained on 160GB of data. And I couldn’t find a mention of a model based on GPT in this comparison.
But isn’t GPT state of the art embeddings? So why is there not even a discussion of how finetuning of it perform compared to the other models?
Does anybody have any insight about choosing the right model for what I’m looking for, or any of the other questions I raised?