Non determinism model loading

hi all,

I want to do hyper parameter tuning and reload my model in a loop. I have realized that if I load the model subsequently like below, it is not the same model that is loaded after calling it the second time the weights are differently initialized. however, in each execution the first one is always the same model and the subsequent ones are also the same, but the first one is always != the second one and so on.
I’m going crazy. what is going on here?

model = BertForSequenceClassification.from_pretrained(MODEL, num_labels=len(label2id), id2label=id2label, label2id=label2id,
                                                            output_attentions=False, output_hidden_states=False)
model.save_pretrained('./model1/')

model = BertForSequenceClassification.from_pretrained(MODEL, num_labels=len(label2id), id2label=id2label, label2id=label2id,
                                                            output_attentions=False, output_hidden_states=False)
model.save_pretrained('./model2/')

If I run this, the weights are not the same.

import torch

model1 = BertForSequenceClassification.from_pretrained('./model1/')
model2 = BertForSequenceClassification.from_pretrained('./model2/')

for p1, p2 in zip(model1.parameters(), model2.parameters()):
    if not torch.allclose(p1, p2):
        print("Weights are not the same.")
        break
else:
    print("Weights are the same.")

I’ve set every imaginable thing to be deterministic:

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

any help is appreciated!

This post is kind of old, but as I was asking myself the same question I thought that replying now could maybe serve others later.

I believe the reason why the reason why the weights are different after subsequent initialization is because the ‘transformers’ implementation follows the same fine-tuning procedure as hinted in the original BERT paper: using same pre-trained checkpoint but different classifier layer initialization.
Thus, here you were probably loading BertForSequenceClassification using a model trained for masked language modeling (e.g. bert-base-cased), therefore only the weights corresponding to the BERTModel part of the model were loaded, and the classification head was randomly initialized. (You probably had a warning telling you about it while loading the model.

Regarding the fact that ‘in each execution the first one is always the same model and the subsequent ones are also the same’, my best guest is that the implementation is generating the random seed in an iterative process (did not check it thought).

Hope this helps!