Reproducible model between SetFit Versions?

Previously I trained a SetFit model using SetFit v0.6.0 (py 3.8)

I have followed the migration steps and have refactored code to move to SetFit v1.0.3 (py 3.11).

I have used the same hyperparameters, and the same random seeds and the same training data.

Output of the v1.0.3 model is reproducible: the same model is created each time.
Output of the v0.6.0 model is also reproducible.

The two models are not the same, and their outputs are significantly different.

Is this expected?

1 Like

I don’t use the NLP model much, but I think it’s normal for the output to be different between different versions of the AI model for more than a year…
In a normal program, even a 10 year old code can work, but in AI, it can change in 6 months.

But if they seem too different, maybe some default parameter has been changed or something. The output can change quite a bit depending not only on the version of the model, but also on the version of the library.

@tomaarsen @lewtun do you have any insights?

We are sharing this in case it helps anyone understand why the two SetFit versions don’t produce the exact same models.

We were unable to produce the exact same models using SetFit v0.6.0 and SetFit v1.0.3. As part of our investigation, we noticed that there were several factors that lead to different fine-tuned models between these SetFit versions:

In ouroriginal post I stated that each model was reproducible, but actually that was not the case when we started problem solving. Even though we had set a seed in setfit.SetFitTrainer() (v0.6.0) and setfit.TrainingArguments() (v1.0.3), the SetFit model’s head was being initialised with random weights at the start of every training run. This meant that the training script produced a different fine-tuned model after each run. This issue was resolved by adding a transformers.trainer_utils.set_seed() call before calling the SetFitModel.from_pretrained() function.

Having done this we got to the state we were in when we posted. The following explanation is the reason we found for the difference in model outputs.

The SetFit model training process also involves creating positive and negative sentence pairs. We noticed that the sampling methods for the two SetFit versions are different and have different logics. SetFit v1.0.3 uses the shuffle_combinations() function and ContrastiveDataset() class in sampler.py to generate and select pairs whereas SetFit v0.6.0 uses the sentence_pairs_generation() function in modeling.py to generate and select pairs.

There may have been some other factors causing this discrepancy as well.

1 Like

Thanks for your info. It works.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.