Background
The quality of sentence embedding models can be increased easily via:
- Larger, more diverse training data
- Larger batch sizes
However, training on large datasets with large batch sizes requires a lot of GPU / TPU memory.
TPU-v3-8 offers with 128 GB a massive amount of memory, enabling the training of amazing sentence embeddings models.
Join me and use this event to train the best sentence embedding models that ever existed.
Roadmap
- Create a Jax training script for MultipleNegativesRankingLoss. MultipleNegativesRankingLoss is currently the best method to train sentence embeddings. As training data, we need text-pairs (textA, textB) where we want textA and textB close in vector space. This can be anything like (question, answer), (text, summary), (paper, related_paper), (input, response).
- Collect suitable training data:
- I already have 25 suitable training datasets, that provide 100+ million training pairs (some are listed here).
- Mine StackExchange (title, question, best_answer) triplets from the stack exchange archive
- Mine Conversational Datasets from Reddit: PolyAI has the script ready
- Extract Wikipedia intro sections for articles that are in the same category
- Do you have further ideas for suitable (large scale) datasets?
- The data collection should give us a train dataset of 1+ billion pairs
- Train on this massive corpus and create the best sentence embedding model that ever existed
Language
The initial training will focus on English. After the event, Multi-Lingual Knowledge Distillation will be used to transfer the model to 50+ languages.
If you have large scale training data for other languages, feel free to provide it and we can try to train a multilingual model too.
Output
We will train different models:
- General purpose model
- Dedicated model for Semantic Search / Question-Answer-Retrieval
- Dedicated model for Conversational AI
Models and datasets will be shared with the community.
Model to Use
We will use RoBERTa and maybe several smaller models (Distil*-Models, MiniLM etc.).
Training Script
Training-Script is not yet available, but creating it is not too difficult: We need a the token embeddings from the model + mean pooling + cosine-similarity + CrossEntropyLoss.
You want to join?
Help is needed on different aspects of the project
- Data crawling & preparation (more data is always better)
- Creating a suitable JAX train script for InfoNCE Loss (I have one for PyTorch)
- Create code for data loading so that we can train on 1B train pairs
Interested to join? Please send me an email to
nils@huggingface.co
so that I can invite you to the kick-off event.
We also have our own Discord server for communication:
Data crawling, preparation and code writing must happen before we get the TPU compute power ( 7.07. - 14.07.)