Why my model behaves differently at each load?


The fine-tuned BertModel on NER task behaves differently at each load of the .bin file.


When the model finishes training:

  • Input
    I am John and I work at Hugging-Face
  • output
    [(I, O), (am, O), (John, PER), (and, O), (I, O), (work, O), (at, O), (Hugging-Face, ORG)]

After stopping the notebook session and loading the model:

  • Input
    I am John and I work at Hugging-Face
  • output
    [(I, PER), (am, PER), (John, PER), (and, PER), (I, ORG), (work, PER), (at, O), (Hugging-Face, PER)]


  • Colab Pro +
  • Transformers == 4.23.1
  • Torch == 1.12.1


I am currently facing an issue with my NER model based on BertModel from the Transformers library and inspired from the BertForTokenClassification code base.

Indeed, the issue is the following, after training and evaluating my model I end up with a well-performing model with a validation accuracy greater than 96%. The problem is that when I save the model and load it for inference it gives different results, yes it gives different predictions (bad) each time it is loaded. It should be noted that when the model has finished training the predictions are good, but when I stop the notebook session and start another one and then load my best model saved, it behaves differently.

Model Architecture:

class NerBertModel(nn.Module):
  def __init__(self, id2label, label2id, num_labels):
    super(PhenoBertModel, self).__init__()
    self.id2label = id2label
    self.label2id = label2id
    self.num_labels = num_labels
    self.bert = Config.MODEL
    classifier_dropout = (
            Config.CONFIG.classifier_dropout if Config.CONFIG.classifier_dropout is not None else Config.CONFIG.hidden_dropout_prob
    self.dropout = nn.Dropout(classifier_dropout)

    self.classifier = nn.Linear(Config.CONFIG.hidden_size, num_labels)

  def forward(self, 
              input_ids: Optional[torch.Tensor] = None, 
              attention_mask: Optional[torch.Tensor] = None, 
              token_type_ids: Optional[torch.Tensor] = None,
              labels: Optional[torch.Tensor] = None):
    outputs = self.bert(input_ids, attention_mask)

    sequence_output = outputs[0]
    sequence_output = self.dropout(sequence_output)
    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    return loss, logits

Saved the model using:

torch.save(model.state_dict(), Config.MODEL_PATH)

Loaded the model using:

model = NerBertModel(id2label, label2id, num_labels=len(id2label))

  Config.MODEL_PATH, # model.bin file


The same problem occurs also when using the standard NER model BertForTokenClassfication from the Transformers library directly while saving it and loading as follows:

# Save best model

# Load the model

The seed function I am using

def seed_torch(seed=42):
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

Problem Fixed

I fixed it. The problem is that each time I run the notebook, unique_labels contains the labels in a different order compared to the previous notebook session, so I end up with different encoding of the labels and this is due to the set() method, which I used to get unique labels and then encode them dynamically as shown in the snippet of code below:

unique_labels = set([label for label in data["token_labels"].values for label in labels])
label2id = {k: v for v, k in enumerate(unique_labels)}
id2label = {v: k for v, k in enumerate(unique_labels)}

It should be noted that the problem occurred even when:

  1. Using the standard BertForTokenClassification model from the Hugging-Face transformers library while using save_pretrained() and from_pretrained(). However, It is recommended to save and load the best model using save_pretrained() and from_pretrained() respectively when it comes to a model based on the Hugging-Face transformers library.
  2. Running the notebook on the local host.

So just try to avoid using set() or sort its output before label encoding, thus you always end up with the same label encoding.