I think this is because the state_dict is not loaded when using from_config().
The following is a response from Hugging Chat.
Based on your description and the code provided, let me explain the two issues you are facing and provide solutions.
First Issue: Evaluation Results Are Different When Switching Between Models
The evaluation results are different when you switch between train_model
(your wrapped model) and origin_model
(the original model) because of how the BertForMask
class is defined and initialized. Hereās why:
-
Initialization of the Wrapped Model:
- In your
BertForMask
class, you are using self.bert = AutoModelForMaskedLM.from_config(config)
during initialization. This creates a new model instance with random weights that are not initialized using the pretrained weights from origin_model
. While the architecture is the same, the weights are different.
- The
origin_model
is initialized with AutoModelForMaskedLM.from_pretrained(model_checkpoint)
, which loads the pretrained weights from the checkpoint (e.g., āgoogle-bert/bert-base-chineseā).
-
Model Weights:
- When you pass
train_model
to the Trainer
, it starts from randomly initialized weights (even though the architecture matches). This explains why the evaluation results are different when you switch between train_model
and origin_model
.
- The
origin_model
is a pretrained model, while train_model
is a new model that hasnāt been trained yet.
Solution
To ensure that train_model
starts with the same weights as origin_model
, you should initialize the wrapped model with the pretrained weights. You can do this by copying the weights from origin_model
to train_model
during initialization. Modify your BertForMask
class as follows:
class BertForMask(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = AutoModelForMaskedLM.from_config(config) # Initialize the model structure
# Copy weights from the origin_model to train_model
self.bert.load_state_dict(origin_model.state_dict())
def forward(self, input_ids, attention_mask, token_type_ids, labels):
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels,
)
return outputs
Second Issue: Saved Model Weights Are Different from Trained Model in Memory
The discrepancy between the weights of the saved model and the trained model in memory is due to how the model is being saved and loaded. Hereās why:
-
Saving the Model:
- In your code, you are saving the model using a custom approach, but you are not explicitly saving the modelās state. You need to use the
save_pretrained()
method provided by the Transformers library to ensure that the weights are saved correctly.
-
Loading the Model:
- When you load the model using
BertForMask(config=saved_config)
, you are initializing a new model with random weights instead of loading the saved weights. Similar to the first issue, the weights are not being properly restored from the saved checkpoint.
Solution
To correctly save and load the model, follow these steps:
-
Save the Trained Model:
train_model.save_pretrained(saved_model_path)
tokenizer.save_pretrained(saved_model_path)
-
Load the Saved Model:
saved_model = BertForMask.from_pretrained(saved_model_path)
saved_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
Modify your BertForMask
class to support the from_pretrained()
method:
class BertForMask(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = AutoModelForMaskedLM.from_config(config)
def forward(self, input_ids, attention_mask, token_type_ids, labels):
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels,
)
return outputs
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
config = AutoConfig.from_pretrained(pretrained_model_name_or_path)
model = cls(config, *model_args, **kwargs)
state_dict = kwargs.pop("state_dict", None)
model.load_state_dict(state_dict)
return model
Final Notes
- The wrapped model (
BertForMask
) should be properly initialized with the pretrained weights using from_pretrained()
.
- When saving and loading the model, use the
save_pretrained()
and from_pretrained()
methods provided by the Transformers library to ensure that the weights are saved and loaded correctly.
By making these changes, the evaluation results between the two models should be consistent, and the saved model weights should match the trained model in memory.
Based on the information provided and the sources referenced, here are the best practices to make the problems youāre experiencing less likely to occur:
Best Practices
1. Proper Initialization of Pretrained Models
- When wrapping a pretrained model (e.g.,
BertForMask
), ensure it is initialized with the same weights as the original model. This avoids discrepancies between the two models.
- In your case, initialize the wrapped model with the pretrained weights using
load_state_dict()
[1][2].
2. Correctly Saving and Loading Models
- Use the
save_pretrained()
and from_pretrained()
methods provided by the Hugging Face Transformers library. These methods ensure that the modelās weights, configuration, and tokenizer are saved and loaded correctly [1][3].
- After training, call
train_model.save_pretrained(saved_model_path)
to save the model.
- When reloading, use
BertForMask.from_pretrained(saved_model_path)
to load the saved weights into your custom model.
3. Fine-Tuning andęØ”å Training
- When fine-tuning transformer models, carefully prepare your dataset. Ensure that the data is tokenized and formatted correctly for the chosen model (e.g., using a
DataCollator
) [4].
- Use warm-up steps and appropriate learning rate scheduling to stabilize training [4].
4. Model Evaluation and Debugging
- After training, evaluate the model to ensure it meets performance requirements [4].
- Use logging and debugging tools to monitor training and identify issues early [3].
5. Robust Model Governance
- Implement robust error handling and debugging mechanisms to ensure smooth model development [3].
Modified Code Incorporating Best Practices
model_checkpoint = "google-bert/bert-base-chinese"
origin_model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
config = origin_model.config
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
class BertForMask(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = AutoModelForMaskedLM.from_config(config)
# Initialize the model with the same weights as origin_model
self.bert.load_state_dict(origin_model.state_dict())
def forward(self, input_ids, attention_mask, token_type_ids, labels):
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels,
)
return outputs
from transformers import Trainer
# Ensure the model is properly initialized
train_model = BertForMask(config)
train_model.save_pretrained(saved_model_path) # Save the model before training
# Define the trainer
trainer = Trainer(
model=train_model,
args=training_args,
train_dataset=downsampled_dataset["train"],
eval_dataset=downsampled_dataset["test"],
data_collator=data_collator,
data_collator=data_collator,
)
# Train and evaluate
trainer.train()
eval_results = trainer.evaluate()
# Save the trained model
train_model.save_pretrained(saved_model_path)
# Reload the model
saved_model = BertForMask.from_pretrained(saved_model_path)
saved_model_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
# Verify weights
print(saved_model.bert.bert.encoder.layer[0].attention.self.query.weight)
print(train_model.bert.bert.encoder.layer[0].attention.self.query.weight)
By following these best practices, you can:
- Ensure consistent initialization between
train_model
and origin_model
.
- Correctly save and load the model to maintain the same weights in memory and on disk.
- Improvemodel training and evaluation reliability [1][3][4].