Original and re-loaded model are not the same

Hi. I fine-tuned the Wav2Vec2ForCTC model using Common Voice’s Greek data, using the code below:

training_args_bl3 = TrainingArguments(
output_dir = ‘bl2-cv-HFTrainer-small-lr’,
group_by_length = True,
per_device_train_batch_size = 16, # batch size
gradient_accumulation_steps = 2,
evaluation_strategy = ‘steps’,
num_train_epochs = 60,
fp16 = True,
save_steps = 298,
eval_steps = 298,
logging_steps = 298,
learning_rate = 3e-4,
warmup_steps = 180,
save_total_limit = 1,
load_best_model_at_end = True,
metric_for_best_model = ‘wer’,
greater_is_better = False)

trainer_bl3 = Trainer(
model = bl2,
data_collator = data_collator,
args = training_args_bl3,
compute_metrics = compute_metrics,
train_dataset = cv_train,
eval_dataset = cv_val,
tokenizer = processor.feature_extractor,
callbacks = [EarlyStoppingCallback(early_stopping_patience = 10)])

trainer_bl3.train()

where the model bl2 is loaded like this:

bl2 = Wav2Vec2ForCTC.from_pretrained(
‘facebook/wav2vec2-large-xlsr-53’,
attention_dropout = 0.1,
hidden_dropout = 0.1,
feat_proj_dropout = 0.0,
mask_time_prob = 0.05,
layerdrop = 0.1,
gradient_checkpointing = True,
ctc_loss_reduction = ‘mean’,
pad_token_id = processor.tokenizer.pad_token_id,
vocab_size = len(processor.tokenizer),
cache_dir = ‘/mnt/twohdd/.cache’)

After the training is completed, I proceed to save the model like this:

trainer_bl3.save_model(‘FILE_NAME’)

and then load it as follows

test = Wav2Vec2ForCTC.from_pretrained(‘FILE_NAME’).to(‘cuda’)

After this, and because I noticed that the two models (bl2-fine-tuned & test-reloaded) yielded different error rates at my test set when I checked, I saw that all their parameters differ. I checked using a function posted in this post Check if models have same weights - PyTorch Forums, by aasharma90.

Am I doing something wrong? Can someone please help me, as it is very important to me to be able to load the exact same fine-tuned model at the future.

Thanks in advance.

EDIT: Just wanted to mention that their scores are very close, but not the same. Also I’d like to note that when making the test set prediction, I am setting both models to evaluation mode.

EDIT2: I just tried to save my model by doing bl2.save_pretrained('FILENAME'), instead of using Trainer()'s save_model(). After loading with the same code show above (from_pretrained()), and comparing the models using the function I linked here, they match.

This is good in the sense that now the models match, however I dont understand why this is happening. As far as I know, Trainer()'s save_model() uses the save_pretrained() method. What’s going wrong here?