Sentence Transformer Fine-Tuning Dataset Curation Clarification

nleroy917 · September 3, 2022, 5:10pm

Hey all, I have a fine-tuning question. I’m following this tutorial here to fine-tune a sentence transformer: Train and Fine-Tune Sentence Transformers Models

In the dataset preparation section, they’ve got this example code:

from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
n_examples = dataset['train'].num_rows // 2

for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))

The tutorial says:

You can obtain much better results by increasing the number of examples.

I’m wondering if this is in reference to the length of train_examples or the length of texts when initializing the InputExample object.

My question is: Say I have 100 sentences that I’ve declared as similar, would it be better to have one InputExample with len(texts) = 100 or 50 InputExamples with len(texts)=2 ?

Topic		Replies	Views
Fine-tuning sentence-transformer for retrieval task makes things worse Beginners	0	1720	July 25, 2023
Fine-tune transformers for language model Beginners	2	662	August 14, 2022
Fine-tune, or train from scratch? Beginners	6	3451	September 16, 2020
Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice 🤗Transformers	2	1198	April 20, 2024
Dataset parameters to finetune a pretrained translation model on new vocabulary Models	0	365	July 5, 2023

Sentence Transformer Fine-Tuning Dataset Curation Clarification

Related topics