Train the Best Sentence Embedding Model Ever with 1B Training Pairs

Background

The quality of sentence embedding models can be increased easily via:

However, training on large datasets with large batch sizes requires a lot of GPU / TPU memory.

TPU-v3-8 offers with 128 GB a massive amount of memory, enabling the training of amazing sentence embeddings models.

Join me and use this event to train the best sentence embedding models that ever existed.

Roadmap

  1. Create a Jax training script for MultipleNegativesRankingLoss. MultipleNegativesRankingLoss is currently the best method to train sentence embeddings. As training data, we need text-pairs (textA, textB) where we want textA and textB close in vector space. This can be anything like (question, answer), (text, summary), (paper, related_paper), (input, response).
  2. Collect suitable training data:
    • I already have 25 suitable training datasets, that provide 100+ million training pairs (some are listed here).
    • Mine StackExchange (title, question, best_answer) triplets from the stack exchange archive
    • Mine Conversational Datasets from Reddit: PolyAI has the script ready
    • Extract Wikipedia intro sections for articles that are in the same category
    • Do you have further ideas for suitable (large scale) datasets?
  3. The data collection should give us a train dataset of 1+ billion pairs
  4. Train on this massive corpus and create the best sentence embedding model that ever existed

Language

The initial training will focus on English. After the event, Multi-Lingual Knowledge Distillation will be used to transfer the model to 50+ languages.

If you have large scale training data for other languages, feel free to provide it and we can try to train a multilingual model too.

Output

We will train different models:

  • General purpose model
  • Dedicated model for Semantic Search / Question-Answer-Retrieval
  • Dedicated model for Conversational AI

Models and datasets will be shared with the community.

Model to Use

We will use RoBERTa and maybe several smaller models (Distil*-Models, MiniLM etc.).

Training Script

Training-Script is not yet available, but creating it is not too difficult: We need a the token embeddings from the model + mean pooling + cosine-similarity + CrossEntropyLoss.

You want to join?

Help is needed on different aspects of the project

  • Data crawling & preparation (more data is always better)
  • Creating a suitable JAX train script for InfoNCE Loss (I have one for PyTorch)
  • Create code for data loading so that we can train on 1B train pairs

Interested to join? Please send me an email to
nils@huggingface.co

so that I can invite you to the kick-off event.

We also have our own Discord server for communication:

Data crawling, preparation and code writing must happen before we get the TPU compute power ( 7.07. - 14.07.)

21 Likes

Very cool idea! I have the impression that Transformer sentence embeddings are a bit neglected since Sentence-BERT and so…I’m interested to join this project!
Useful info about me:

  • I’m a young AI researcher working in Indigo.ai, a little company based in Italy and focused on conversational AI
  • I have experience with the :hugs: Transformers library, but I never used JAX. I think that this project could be an occasion for me to gain confidence whith this library, so I would like to work on the training script for this project, but obviously I’m open to other tasks! :slight_smile:
  • If you want to know my time zone…I live in Italy! :it:

Just one question: what’s the idea to train this model? It seems that you only need couples of similar sentences, right? No negative examples (sentences with different meanings) are required?

1 Like

Hi @mmuffo94 Great to hear that. Yes, for training you just need positives. Negatives are sampled automatically from the other examples in a batch (in-batch negatives). Works extremely well and the larger the batch size => the better your results.

Are you already on slack? Create a channel here for the communication:

Sounds fun. I would like to join the team.

2 Likes

It’s really interesting. I would like to join this project. I hope after this event we can create a great model for Conversational AI in 50+ languages.

1 Like

Hey @nreimers,

I am super interested in this & would like to join the team.

About me: I contributed bigbird (in flax & pytorch) to Transformers. Also, fine-tuned flax version on natural-questions (~ 100 GB in size) dataset (made a PR for adding training script to Transformers examples) on TPU v3-8.

Time Zone: India

2 Likes

It sounds interesting and reminds me to the DPR training process for ELI5 project by Yacine. So, I also would like to join

1 Like

Hi @nreimers,
I have worked a bit on sentence embeddings and would love to join community week working on this topic.
Cheers,
Dennis

1 Like

Hi @nreimers : I would love to be part of the group as well. I’m based out of India and currently, work as a Senior Applied Data Scientist and have experience working on end-to-end NLP projects. Have used transformers extensively but new to JAX. Would take this opportunity to learn new skills plus contribute to open-source (have just started the journey).

1 Like

Hi @nreimers, I would like to join as well!

1 Like

Hey @nreimers, this is a great idea! I work in industry, and sentence embeddings are enormously useful for the applications I work on, so I’m very interesting in contributing.

In particular, I’d love to help with the data collection process. Finding different types of pairs/triplets that can increase performance in novel applications is very intriguing, and I can see a lot of value in expanding to more heterogeneous pairs.

Happy about all the positive feedback.

Would be great if you could send me a email to nils@huggingface.co

I plan a kick-off event next week to share some educational material and to talk about the project. The kick-off will be open for everyone who want’s to learn how to train dense embedding models.

I also create a discord server which we can use for communication:

3 Likes

Hello Nils, I’d like to join this project. Will leave a note in the email!

Great to have so many people here. :clap:

I want to have a kick-off event next week. To find a good time-slot, would be great if you could fill-in your available slots here:
https://www.when2meet.com/?12195163-8QN3P

The kick-off event will be record (and hosted on Youtube), so that everyone who misses it can watch it.

Content of the event:

  • Theory on how to train good sentence embedding models
  • Possible datasets we have and which we should collect (also looking forward which datasets you might have)
  • English vs. multilingual model
  • Organization: Who can help with datasets? Who can help with programming with JAX? Who can help programming the data loading setup? Who is interested in evaluation of the model?

Don’t forget to join our discord server to get the latest news:

Sadly, don’t have the time to help participate, but just wanted to throw in a potential large dataset y’all could use. The CodeSearchNet challenge dataset has multiple programming languages and pairs of code, javadoc that might be a cool additional to the ones already list :nerd_face:. The dataset is also available via HF datasets code_search_net · Datasets at Hugging Face

1 Like

Hey @nreimers,
This is a great initiative!

Did you consider using a different model than RoBERTa? I can imagine that a newer model like ALBERT could lead to better performance. Two main reasons:

  1. ALBERT is both pretrained with MLM as well as a sentence segment coherence objective, which was specifically designed for downstream tasks with multi-sentence inputs and leads to better performance than RoBERTa on these task. I can imagine that a sentence embeddings benefit from this too. RoBERTa was only pretrained with MLM.
  2. ALBERT was specifically designed for setups where memory limitations play an important role. This seems to be exactly your usecase if I understand correctly.

See details here: [1909.11942] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

1 Like

It sounds fine. I would be interested to join!

1 Like

Hi @MoritzLaurer
Thanks for the suggestion.

Quality on supervised benchmarks like GLUE / SuperGLUE does sadly not correlate with performance for dense embedding models. Many of the new models, that perform better on GLUE / SuperGLUE, sadly fail to produce good vector spaces.

You can find some benchmarks here:
https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

The ‘paraphrase-albert-base-v2’ model performed on par with ‘paraphrase-albert-small-v2’, and both performed worse than e.g. DistilRoBERTa.

So far I made the best experiences with MPNet, but which is sadly not yet available in JAX.

Second best results were achieved by BERT (and variations like TinyBERT, DistilBERT, MiniLM) and RoBERTa.

But the ALBERT model is interesting due to its small size. So if we have enough compute power in that week left, we could also try to tune with ALBERT.

3 Likes

Any reason for not preferring a contrastive learning framework like in this one https://arxiv.org/pdf/2104.08821.pdf?

I’d love to join!