Train the Best Sentence Embedding Model Ever with 1B Training Pairs

Hi @mandubian
For this project we should build up-on tested methods, as we have the compute just for a week.

But in general it would be really cool to know if MLPs also work for sentence embedding models.

1week… short… I understand… and I must say that it’s hard to justify such study at work and at home I don’t have the power for it… Anyway, if you can at least build this generic embeddings, it’s a first victory :slight_smile:

Thanks for the clarification :slight_smile:

Amazing to see so much activity here! Finalizing this project!

Hi @nreimers . How are you planning to prepare the training data. I know, compartitively its simple comparing to MLM tasks.
Is it preprocessing on the fly while training or preprocessing before training and save it as tfrecords or something?

Data might be too large to keep it in RAM (not sure, will see).

So likely we will be reading data from disc and feed it to the network. Data comes already in a pre-processed format in the sense that that the data file will contain one example pair per line.

So the idea is it: Read n lines from disc => feed to network

With n the batch size.

1 Like

Understood.
But huggingface datasets and tensorflow datasets are (doing lazy loading) right.
So, if we collate all data pairs as string pairs, can’t we avoida the pre-processing overhead ( only over head here is tokenize(sentence)[:MAX_LENGTH] ) .
Anyway thanks for your reply. :slight_smile:

@nreimers I’m particularly interested in the QA/Semantic Search aspect of this and would like to join that component if possible.

Hi @nreimers – I have added my name to the google sheet for QA/Semantic Search as well as Poly.AI Conversation AI row. Hope that works fine with the team. Looking forward to working together.

@wolosonovich Great. Have you joined already our discord server? There is a spreadsheet where you can add your name to the teams.

@devv: Great, thanks for your help :slight_smile:

@nreimers when I attempt to join the discord server it says that I’m “unable to accept” the invitation.

Hi @wolosonovich
You can try this link: flax-jax-community-week-sentence-embeddings

same result @nreimers

image

Not sure why this happens.

Here are others who have sometimes similar issues:

https://support.discord.com/hc/en-us/community/posts/1500000435101-unable-to-accept-invite-discord

Have you maybe already joined 100 servers?

Excited to join this, excited to get started!

I’m George Sivulka and I live for neural IR.

I figured it out @nreimers i tried opening the link with firefox and it worked (i was using chrome on linux originally)

Comparison of Sentence-BERT and OpenAI’s Ada for Text Embedding

In this report, we compare two text embedding models: Sentence-BERT (SBERT) and OpenAI’s Ada model. These models transform text inputs into vector representations, widely known as embeddings. We examined their performance on a multilingual dataset using cosine similarity as the metric to assess the closeness of the generated embeddings.

Experimental Setup

Our experiment was conducted using the following query sentence in Korean:

"다가오는 여행에 정말 기대돼요. 새로운 장소를 탐험하는 걸 기다릴 수가 없어요!"

We paired this query with the following target sentences in French, German, English, and Korean:

[
"Je suis en train de préparer un délicieux dîner pour mes invités. J'espère qu'ils vont adorer!",
"Die Präsentation war sehr informativ und hat mir neue Einblicke gegeben. Ich bin beeindruckt!",
"I'm eagerly anticipating the upcoming journey. Looking forward to discovering new destinations!",
"버트런드 러셀의 세가지 열정을 통해 그의 지성인으로써의 면모뿐 아니라 한 인간으로써 깊은 연민의 감정과 솔직함을 함께 볼 수 있다."
]

These sentences were encoded using both the SBERT and OpenAI Ada models. The cosine similarity between the query sentence and each target sentence was then calculated.

Results

The results of the cosine similarity calculations are as follows:

SBERT
Cosine Similarity Scores: [0.1356698, 0.076096766, 0.015867135, 0.58982027]

OpenAI Ada
Cosine Similarity Scores: [0.7172783545691249, 0.727737901177177, 0.8542776604362744, 0.7744492503622011]

As can be seen from the above results, OpenAI’s Ada model outperformed SBERT in detecting similarity across languages. All the cosine similarity scores from the OpenAI model were above 0.7, whereas the scores from SBERT were much lower.

Implications and Next Steps

These results suggest that the OpenAI Ada model might be a more suitable choice for tasks involving multiple languages or when semantic similarity is required across different languages.

SBERT, while not performing as well in this experiment, may still be suitable for tasks within a single language context, especially where fine-tuning capabilities are required.

However, it’s important to conduct further tests before concluding which model is most suitable. Such tests might include more metrics and larger datasets, and they should take into account the limitations and strengths of each model.

In light of these results, we are interested in hearing the community’s thoughts and plans regarding contrastive training for cross-language samples. This kind of training, where the model is trained to bring semantically similar samples closer together in the embedding space and push dissimilar samples apart, could potentially improve performance on tasks involving multiple languages.

Please feel free to share your thoughts, ideas, and any plans you might have related to this topic.