I don’t use the NLP model much, but I think it’s normal for the output to be different between different versions of the AI model for more than a year…
In a normal program, even a 10 year old code can work, but in AI, it can change in 6 months.
But if they seem too different, maybe some default parameter has been changed or something. The output can change quite a bit depending not only on the version of the model, but also on the version of the library.
We are sharing this in case it helps anyone understand why the two SetFit versions don’t produce the exact same models.
We were unable to produce the exact same models using SetFit v0.6.0 and SetFit v1.0.3. As part of our investigation, we noticed that there were several factors that lead to different fine-tuned models between these SetFit versions:
In ouroriginal post I stated that each model was reproducible, but actually that was not the case when we started problem solving. Even though we had set a seed in setfit.SetFitTrainer() (v0.6.0) and setfit.TrainingArguments() (v1.0.3), the SetFit model’s head was being initialised with random weights at the start of every training run. This meant that the training script produced a different fine-tuned model after each run. This issue was resolved by adding a transformers.trainer_utils.set_seed() call before calling the SetFitModel.from_pretrained() function.
Having done this we got to the state we were in when we posted. The following explanation is the reason we found for the difference in model outputs.
The SetFit model training process also involves creating positive and negative sentence pairs. We noticed that the sampling methods for the two SetFit versions are different and have different logics. SetFit v1.0.3 uses the shuffle_combinations() function and ContrastiveDataset() class in sampler.py to generate and select pairs whereas SetFit v0.6.0 uses the sentence_pairs_generation() function in modeling.py to generate and select pairs.
There may have been some other factors causing this discrepancy as well.