Train the Best Sentence Embedding Model Ever with 1B Training Pairs

Background

The quality of sentence embedding models can be increased easily via:

However, training on large datasets with large batch sizes requires a lot of GPU / TPU memory.

TPU-v3-8 offers with 128 GB a massive amount of memory, enabling the training of amazing sentence embeddings models.

Join me and use this event to train the best sentence embedding models that ever existed.

Roadmap

  1. Create a Jax training script for MultipleNegativesRankingLoss. MultipleNegativesRankingLoss is currently the best method to train sentence embeddings. As training data, we need text-pairs (textA, textB) where we want textA and textB close in vector space. This can be anything like (question, answer), (text, summary), (paper, related_paper), (input, response).
  2. Collect suitable training data:
    • I already have 25 suitable training datasets, that provide 100+ million training pairs (some are listed here).
    • Mine StackExchange (title, question, best_answer) triplets from the stack exchange archive
    • Mine Conversational Datasets from Reddit: PolyAI has the script ready
    • Extract Wikipedia intro sections for articles that are in the same category
    • Do you have further ideas for suitable (large scale) datasets?
  3. The data collection should give us a train dataset of 1+ billion pairs
  4. Train on this massive corpus and create the best sentence embedding model that ever existed

Language

The initial training will focus on English. After the event, Multi-Lingual Knowledge Distillation will be used to transfer the model to 50+ languages.

If you have large scale training data for other languages, feel free to provide it and we can try to train a multilingual model too.

Output

We will train different models:

  • General purpose model
  • Dedicated model for Semantic Search / Question-Answer-Retrieval
  • Dedicated model for Conversational AI

Models and datasets will be shared with the community.

Model to Use

We will use RoBERTa and maybe several smaller models (Distil*-Models, MiniLM etc.).

Training Script

Training-Script is not yet available, but creating it is not too difficult: We need a the token embeddings from the model + mean pooling + cosine-similarity + CrossEntropyLoss.

You want to join?

Help is needed on different aspects of the project

  • Data crawling & preparation (more data is always better)
  • Creating a suitable JAX train script for InfoNCE Loss (I have one for PyTorch)
  • Create code for data loading so that we can train on 1B train pairs

Interested to join? Please send me an email to
nils@huggingface.co

so that I can invite you to the kick-off event.

We also have our own Discord server for communication:

Data crawling, preparation and code writing must happen before we get the TPU compute power ( 7.07. - 14.07.)

19 Likes

Very cool idea! I have the impression that Transformer sentence embeddings are a bit neglected since Sentence-BERT and so…I’m interested to join this project!
Useful info about me:

  • I’m a young AI researcher working in Indigo.ai, a little company based in Italy and focused on conversational AI
  • I have experience with the :hugs: Transformers library, but I never used JAX. I think that this project could be an occasion for me to gain confidence whith this library, so I would like to work on the training script for this project, but obviously I’m open to other tasks! :slight_smile:
  • If you want to know my time zone…I live in Italy! :it:

Just one question: what’s the idea to train this model? It seems that you only need couples of similar sentences, right? No negative examples (sentences with different meanings) are required?

1 Like

Hi @mmuffo94 Great to hear that. Yes, for training you just need positives. Negatives are sampled automatically from the other examples in a batch (in-batch negatives). Works extremely well and the larger the batch size => the better your results.

Are you already on slack? Create a channel here for the communication:
https://join.slack.com/share/zt-s7z62gb7-e1yhRn2l9aWSlabuGhe~xQ

Sounds fun. I would like to join the team.

2 Likes

Hey @nreimers,

I am super interested in this & would like to join the team.

About me: I contributed bigbird (in flax & pytorch) to Transformers. Also, fine-tuned flax version on natural-questions (~ 100 GB in size) dataset (made a PR for adding training script to Transformers examples) on TPU v3-8.

Time Zone: India

2 Likes

It sounds interesting and reminds me to the DPR training process for ELI5 project by Yacine. So, I also would like to join

1 Like

Hi @nreimers,
I have worked a bit on sentence embeddings and would love to join community week working on this topic.
Cheers,
Dennis

1 Like

Hi @nreimers : I would love to be part of the group as well. I’m based out of India and currently, work as a Senior Applied Data Scientist and have experience working on end-to-end NLP projects. Have used transformers extensively but new to JAX. Would take this opportunity to learn new skills plus contribute to open-source (have just started the journey).

1 Like

Hi @nreimers, I would like to join as well!

1 Like

Hey @nreimers, this is a great idea! I work in industry, and sentence embeddings are enormously useful for the applications I work on, so I’m very interesting in contributing.

In particular, I’d love to help with the data collection process. Finding different types of pairs/triplets that can increase performance in novel applications is very intriguing, and I can see a lot of value in expanding to more heterogeneous pairs.

Happy about all the positive feedback.

Would be great if you could send me a email to nils@huggingface.co

I plan a kick-off event next week to share some educational material and to talk about the project. The kick-off will be open for everyone who want’s to learn how to train dense embedding models.

I also create a discord server which we can use for communication:

3 Likes

Great to have so many people here. :clap:

I want to have a kick-off event next week. To find a good time-slot, would be great if you could fill-in your available slots here:
https://www.when2meet.com/?12195163-8QN3P

The kick-off event will be record (and hosted on Youtube), so that everyone who misses it can watch it.

Content of the event:

  • Theory on how to train good sentence embedding models
  • Possible datasets we have and which we should collect (also looking forward which datasets you might have)
  • English vs. multilingual model
  • Organization: Who can help with datasets? Who can help with programming with JAX? Who can help programming the data loading setup? Who is interested in evaluation of the model?

Don’t forget to join our discord server to get the latest news:

Sadly, don’t have the time to help participate, but just wanted to throw in a potential large dataset y’all could use. The CodeSearchNet challenge dataset has multiple programming languages and pairs of code, javadoc that might be a cool additional to the ones already list :nerd_face:. The dataset is also available via HF datasets code_search_net · Datasets at Hugging Face

1 Like

Hey @nreimers,
This is a great initiative!

Did you consider using a different model than RoBERTa? I can imagine that a newer model like ALBERT could lead to better performance. Two main reasons:

  1. ALBERT is both pretrained with MLM as well as a sentence segment coherence objective, which was specifically designed for downstream tasks with multi-sentence inputs and leads to better performance than RoBERTa on these task. I can imagine that a sentence embeddings benefit from this too. RoBERTa was only pretrained with MLM.
  2. ALBERT was specifically designed for setups where memory limitations play an important role. This seems to be exactly your usecase if I understand correctly.

See details here: [1909.11942] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

1 Like

It sounds fine. I would be interested to join!

1 Like

Hi @MoritzLaurer
Thanks for the suggestion.

Quality on supervised benchmarks like GLUE / SuperGLUE does sadly not correlate with performance for dense embedding models. Many of the new models, that perform better on GLUE / SuperGLUE, sadly fail to produce good vector spaces.

You can find some benchmarks here:
https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

The ‘paraphrase-albert-base-v2’ model performed on par with ‘paraphrase-albert-small-v2’, and both performed worse than e.g. DistilRoBERTa.

So far I made the best experiences with MPNet, but which is sadly not yet available in JAX.

Second best results were achieved by BERT (and variations like TinyBERT, DistilBERT, MiniLM) and RoBERTa.

But the ALBERT model is interesting due to its small size. So if we have enough compute power in that week left, we could also try to tune with ALBERT.

3 Likes

Any reason for not preferring a contrastive learning framework like in this one https://arxiv.org/pdf/2104.08821.pdf?

I’d love to join!

Hey @nreimers,
Cool initiative! Sentence embeddings have been one of my main NLP tool last year and SBert allowed me to bootstrap a few working PoC… Yet it was quite perfectible, specially in custom contexts, in specific domains. So it’s great idea to push this massive study for generic embedding. But I think it would be cool also to study how such embedding can be contextualized or make it easier to contextualize or embed concepts that would allow contextualization etc…

Finally, having studied & played with this paper “Pay Attention to MLPs” (https://arxiv.org/pdf/2105.08050.pdf) that studies how MLP could be alternative to self-attention in seq2seq models, I would be curious to see how it behaves for sentence embeddings :wink:

Anyway, If I can help, I’d be happy to contribute.

Hi @paws
The loss function in the linked SimCSE paper is just the MultipleNegativesRankingLoss, a loss function that has long been known and used many times to train sentence embeddings. Nothing novel, that loss function was already implemented in version 0.0.1 of sentence-transformers in early 2019.

But this loss function is quite old. I think this paper from 2007 proposed the loss for the first time (known als ListNet): Learning to Rank: From Pairwise Approach to Listwise Approach - Microsoft Research

This paper from 2017 used that loss function and combines it with in-batch negatives (section 4.4): [1705.00652] Efficient Natural Language Response Suggestion for Smart Reply

Many subsequent paper have used it successfully to train embedding models (e.g. [1803.11175] Universal Sentence Encoder, [1810.12836] Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model, Training Neural Response Selection for Task-Oriented Dialogue Systems - ACL Anthology, [1911.03688] ConveRT: Efficient and Accurate Conversational Representations from Transformers, [1902.08564] Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax, [2007.01852] Language-agnostic BERT Sentence Embedding)

So yes, this approach is one of the most common (and successful) ways to train sentence-embeddings. Hence, it will also be used for this project.

1 Like