Train the Best Sentence Embedding Model Ever with 1B Training Pairs

Comparison of Sentence-BERT and OpenAI’s Ada for Text Embedding

In this report, we compare two text embedding models: Sentence-BERT (SBERT) and OpenAI’s Ada model. These models transform text inputs into vector representations, widely known as embeddings. We examined their performance on a multilingual dataset using cosine similarity as the metric to assess the closeness of the generated embeddings.

Experimental Setup

Our experiment was conducted using the following query sentence in Korean:

"다가오는 여행에 정말 기대돼요. 새로운 장소를 탐험하는 걸 기다릴 수가 없어요!"

We paired this query with the following target sentences in French, German, English, and Korean:

[
"Je suis en train de préparer un délicieux dîner pour mes invités. J'espère qu'ils vont adorer!",
"Die Präsentation war sehr informativ und hat mir neue Einblicke gegeben. Ich bin beeindruckt!",
"I'm eagerly anticipating the upcoming journey. Looking forward to discovering new destinations!",
"버트런드 러셀의 세가지 열정을 통해 그의 지성인으로써의 면모뿐 아니라 한 인간으로써 깊은 연민의 감정과 솔직함을 함께 볼 수 있다."
]

These sentences were encoded using both the SBERT and OpenAI Ada models. The cosine similarity between the query sentence and each target sentence was then calculated.

Results

The results of the cosine similarity calculations are as follows:

SBERT
Cosine Similarity Scores: [0.1356698, 0.076096766, 0.015867135, 0.58982027]

OpenAI Ada
Cosine Similarity Scores: [0.7172783545691249, 0.727737901177177, 0.8542776604362744, 0.7744492503622011]

As can be seen from the above results, OpenAI’s Ada model outperformed SBERT in detecting similarity across languages. All the cosine similarity scores from the OpenAI model were above 0.7, whereas the scores from SBERT were much lower.

Implications and Next Steps

These results suggest that the OpenAI Ada model might be a more suitable choice for tasks involving multiple languages or when semantic similarity is required across different languages.

SBERT, while not performing as well in this experiment, may still be suitable for tasks within a single language context, especially where fine-tuning capabilities are required.

However, it’s important to conduct further tests before concluding which model is most suitable. Such tests might include more metrics and larger datasets, and they should take into account the limitations and strengths of each model.

In light of these results, we are interested in hearing the community’s thoughts and plans regarding contrastive training for cross-language samples. This kind of training, where the model is trained to bring semantically similar samples closer together in the embedding space and push dissimilar samples apart, could potentially improve performance on tasks involving multiple languages.

Please feel free to share your thoughts, ideas, and any plans you might have related to this topic.