Train the Best Sentence Embedding Model Ever with 1B Training Pairs

nreimers · June 25, 2021, 12:32pm

Background

The quality of sentence embedding models can be increased easily via:

Larger, more diverse training data
Larger batch sizes

batch_size784×571 77.9 KB

However, training on large datasets with large batch sizes requires a lot of GPU / TPU memory.

TPU-v3-8 offers with 128 GB a massive amount of memory, enabling the training of amazing sentence embeddings models.

Join me and use this event to train the best sentence embedding models that ever existed.

Roadmap

Create a Jax training script for MultipleNegativesRankingLoss. MultipleNegativesRankingLoss is currently the best method to train sentence embeddings. As training data, we need text-pairs (textA, textB) where we want textA and textB close in vector space. This can be anything like (question, answer), (text, summary), (paper, related_paper), (input, response).
Collect suitable training data:
- I already have 25 suitable training datasets, that provide 100+ million training pairs (some are listed here).
- Mine StackExchange (title, question, best_answer) triplets from the stack exchange archive
- Mine Conversational Datasets from Reddit: PolyAI has the script ready
- Extract Wikipedia intro sections for articles that are in the same category
- Do you have further ideas for suitable (large scale) datasets?
The data collection should give us a train dataset of 1+ billion pairs
Train on this massive corpus and create the best sentence embedding model that ever existed

Language

The initial training will focus on English. After the event, Multi-Lingual Knowledge Distillation will be used to transfer the model to 50+ languages.

If you have large scale training data for other languages, feel free to provide it and we can try to train a multilingual model too.

Output

We will train different models:

General purpose model
Dedicated model for Semantic Search / Question-Answer-Retrieval
Dedicated model for Conversational AI

Models and datasets will be shared with the community.

Model to Use

We will use RoBERTa and maybe several smaller models (Distil*-Models, MiniLM etc.).

Training Script

Training-Script is not yet available, but creating it is not too difficult: We need a the token embeddings from the model + mean pooling + cosine-similarity + CrossEntropyLoss.

You want to join?

Help is needed on different aspects of the project

Data crawling & preparation (more data is always better)
Creating a suitable JAX train script for InfoNCE Loss (I have one for PyTorch)
Create code for data loading so that we can train on 1B train pairs

Interested to join? Please send me an email to
nils@huggingface.co

so that I can invite you to the kick-off event.

We also have our own Discord server for communication:

Data crawling, preparation and code writing must happen before we get the TPU compute power ( 7.07. - 14.07.)

mmuffo94 · June 25, 2021, 12:43pm

Very cool idea! I have the impression that Transformer sentence embeddings are a bit neglected since Sentence-BERT and so…I’m interested to join this project!
Useful info about me:

I’m a young AI researcher working in Indigo.ai, a little company based in Italy and focused on conversational AI
I have experience with the Transformers library, but I never used JAX. I think that this project could be an occasion for me to gain confidence whith this library, so I would like to work on the training script for this project, but obviously I’m open to other tasks!
If you want to know my time zone…I live in Italy!

Just one question: what’s the idea to train this model? It seems that you only need couples of similar sentences, right? No negative examples (sentences with different meanings) are required?

nreimers · June 25, 2021, 1:22pm

Hi @mmuffo94 Great to hear that. Yes, for training you just need positives. Negatives are sampled automatically from the other examples in a batch (in-batch negatives). Works extremely well and the larger the batch size => the better your results.

Are you already on slack? Create a channel here for the communication:
https://join.slack.com/share/zt-s7z62gb7-e1yhRn2l9aWSlabuGhe~xQ

scordee · June 25, 2021, 1:29pm

Sounds fun. I would like to join the team.

vasudevgupta · June 25, 2021, 1:48pm

Hey @nreimers,

I am super interested in this & would like to join the team.

About me: I contributed bigbird (in flax & pytorch) to Transformers. Also, fine-tuned flax version on natural-questions (~ 100 GB in size) dataset (made a PR for adding training script to Transformers examples) on TPU v3-8.

Time Zone: India

cahya · June 25, 2021, 1:50pm

It sounds interesting and reminds me to the DPR training process for ELI5 project by Yacine. So, I also would like to join

bakhuisdennis · June 25, 2021, 1:54pm

Hi @nreimers,
I have worked a bit on sentence embeddings and would love to join community week working on this topic.
Cheers,
Dennis

abinayam · June 25, 2021, 2:01pm

Hi @nreimers : I would love to be part of the group as well. I’m based out of India and currently, work as a Senior Applied Data Scientist and have experience working on end-to-end NLP projects. Have used transformers extensively but new to JAX. Would take this opportunity to learn new skills plus contribute to open-source (have just started the journey).

manandey · June 25, 2021, 2:13pm

Hi @nreimers, I would like to join as well!

davidscripka · June 25, 2021, 2:26pm

Hey @nreimers, this is a great idea! I work in industry, and sentence embeddings are enormously useful for the applications I work on, so I’m very interesting in contributing.

In particular, I’d love to help with the data collection process. Finding different types of pairs/triplets that can increase performance in novel applications is very intriguing, and I can see a lot of value in expanding to more heterogeneous pairs.

nreimers · June 25, 2021, 2:37pm

Happy about all the positive feedback.

Would be great if you could send me a email to nils@huggingface.co

I plan a kick-off event next week to share some educational material and to talk about the project. The kick-off will be open for everyone who want’s to learn how to train dense embedding models.

I also create a discord server which we can use for communication:

nreimers · June 25, 2021, 6:43pm

Great to have so many people here.

I want to have a kick-off event next week. To find a good time-slot, would be great if you could fill-in your available slots here:
https://www.when2meet.com/?12195163-8QN3P

The kick-off event will be record (and hosted on Youtube), so that everyone who misses it can watch it.

Content of the event:

Theory on how to train good sentence embedding models
Possible datasets we have and which we should collect (also looking forward which datasets you might have)
English vs. multilingual model
Organization: Who can help with datasets? Who can help with programming with JAX? Who can help programming the data loading setup? Who is interested in evaluation of the model?

Don’t forget to join our discord server to get the latest news:

ncoop57 · June 25, 2021, 8:00pm

Sadly, don’t have the time to help participate, but just wanted to throw in a potential large dataset y’all could use. The CodeSearchNet challenge dataset has multiple programming languages and pairs of code, javadoc that might be a cool additional to the ones already list . The dataset is also available via HF datasets code_search_net · Datasets at Hugging Face

MoritzLaurer · June 25, 2021, 9:43pm

Hey @nreimers,
This is a great initiative!

Did you consider using a different model than RoBERTa? I can imagine that a newer model like ALBERT could lead to better performance. Two main reasons:

ALBERT is both pretrained with MLM as well as a sentence segment coherence objective, which was specifically designed for downstream tasks with multi-sentence inputs and leads to better performance than RoBERTa on these task. I can imagine that a sentence embeddings benefit from this too. RoBERTa was only pretrained with MLM.
ALBERT was specifically designed for setups where memory limitations play an important role. This seems to be exactly your usecase if I understand correctly.

See details here: [1909.11942] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

vladdy · June 25, 2021, 10:08pm

It sounds fine. I would be interested to join!

nreimers · June 26, 2021, 3:12am

Hi @MoritzLaurer
Thanks for the suggestion.

Quality on supervised benchmarks like GLUE / SuperGLUE does sadly not correlate with performance for dense embedding models. Many of the new models, that perform better on GLUE / SuperGLUE, sadly fail to produce good vector spaces.

You can find some benchmarks here:
https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

The ‘paraphrase-albert-base-v2’ model performed on par with ‘paraphrase-albert-small-v2’, and both performed worse than e.g. DistilRoBERTa.

So far I made the best experiences with MPNet, but which is sadly not yet available in JAX.

Second best results were achieved by BERT (and variations like TinyBERT, DistilBERT, MiniLM) and RoBERTa.

But the ALBERT model is interesting due to its small size. So if we have enough compute power in that week left, we could also try to tune with ALBERT.

paws · June 26, 2021, 11:00am

Any reason for not preferring a contrastive learning framework like in this one https://arxiv.org/pdf/2104.08821.pdf?

shabie · June 27, 2021, 1:46pm

I’d love to join!

mandubian · June 27, 2021, 6:11pm

Hey @nreimers,
Cool initiative! Sentence embeddings have been one of my main NLP tool last year and SBert allowed me to bootstrap a few working PoC… Yet it was quite perfectible, specially in custom contexts, in specific domains. So it’s great idea to push this massive study for generic embedding. But I think it would be cool also to study how such embedding can be contextualized or make it easier to contextualize or embed concepts that would allow contextualization etc…

Finally, having studied & played with this paper “Pay Attention to MLPs” (https://arxiv.org/pdf/2105.08050.pdf) that studies how MLP could be alternative to self-attention in seq2seq models, I would be curious to see how it behaves for sentence embeddings

Anyway, If I can help, I’d be happy to contribute.

nreimers · June 28, 2021, 8:09am

Hi @paws
The loss function in the linked SimCSE paper is just the MultipleNegativesRankingLoss, a loss function that has long been known and used many times to train sentence embeddings. Nothing novel, that loss function was already implemented in version 0.0.1 of sentence-transformers in early 2019.

But this loss function is quite old. I think this paper from 2007 proposed the loss for the first time (known als ListNet): Learning to Rank: From Pairwise Approach to Listwise Approach - Microsoft Research

This paper from 2017 used that loss function and combines it with in-batch negatives (section 4.4): [1705.00652] Efficient Natural Language Response Suggestion for Smart Reply

Many subsequent paper have used it successfully to train embedding models (e.g. [1803.11175] Universal Sentence Encoder, [1810.12836] Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model, Training Neural Response Selection for Task-Oriented Dialogue Systems - ACL Anthology, [1911.03688] ConveRT: Efficient and Accurate Conversational Representations from Transformers, [1902.08564] Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax, [2007.01852] Language-agnostic BERT Sentence Embedding)

So yes, this approach is one of the most common (and successful) ways to train sentence-embeddings. Hence, it will also be used for this project.

Topic		Replies	Views
Improving Sentence Embeddings Models	0	32	April 14, 2025
Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏 Beginners	3	1956	April 13, 2023
Train a VAE to interpolate on English sentences Flax/JAX Projects	6	4485	November 16, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
DALL-E - mini version Flax/JAX Projects	52	8566	August 22, 2021