Two questions when I wraped the AutoModelForMaskedLM

Iā€™m trying to wrap AutoModelForMaskedLM and train it, Here is my code:

model_checkpoint = "google-bert/bert-base-chinese"
origin_model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
config = origin_model .config
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

class BertForMask(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        
        self.bert = AutoModelForMaskedLM.from_config(config)
         
    def forward(self, input_ids, attention_mask, token_type_ids, labels):
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            labels=labels)
        return outputs

from transformers import Trainer

train_model = BertForMask(config)
trainer = Trainer(
    model=train_model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    processing_class=tokenizer, 
)

eval_results = trainer.evaluate()

As shown above, I wraped AutoModelForMaskedLM by defining a model called BertForMask, and itā€™s structure is the same as AutoModelForMaskedLM . Then I define trainer and call evaluate().
The first question is that if I change the model paramater in Trainer from train_model to origin_model , the eval_results will be different. I donā€™t understand beacause there is no essential difference between the two models.

The second question is that after training the train_model, saving it to my local and reloading it from my local, the weight in paramters are also different from the trained one in memory. Here is my code:

saved_model_path=r"XXX"
from transformers import AutoConfig, AutoTokenizer
saved_config = AutoConfig.from_pretrained(saved_model_path)
saved_model = BertForMask(config=saved_config)
saved_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)

print(saved_model.bert.bert.encoder.layer[0].attention.self.query.weight)
print(train_model.bert.bert.encoder.layer[0].attention.self.query.weight)

Could anyone please help me?

1 Like

I think this is because the state_dict is not loaded when using from_config().
The following is a response from Hugging Chat.


Based on your description and the code provided, let me explain the two issues you are facing and provide solutions.


First Issue: Evaluation Results Are Different When Switching Between Models

The evaluation results are different when you switch between train_model (your wrapped model) and origin_model (the original model) because of how the BertForMask class is defined and initialized. Hereā€™s why:

  1. Initialization of the Wrapped Model:

    • In your BertForMask class, you are using self.bert = AutoModelForMaskedLM.from_config(config) during initialization. This creates a new model instance with random weights that are not initialized using the pretrained weights from origin_model. While the architecture is the same, the weights are different.
    • The origin_model is initialized with AutoModelForMaskedLM.from_pretrained(model_checkpoint), which loads the pretrained weights from the checkpoint (e.g., ā€œgoogle-bert/bert-base-chineseā€).
  2. Model Weights:

    • When you pass train_model to the Trainer, it starts from randomly initialized weights (even though the architecture matches). This explains why the evaluation results are different when you switch between train_model and origin_model.
    • The origin_model is a pretrained model, while train_model is a new model that hasnā€™t been trained yet.

Solution

To ensure that train_model starts with the same weights as origin_model, you should initialize the wrapped model with the pretrained weights. You can do this by copying the weights from origin_model to train_model during initialization. Modify your BertForMask class as follows:

class BertForMask(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bert = AutoModelForMaskedLM.from_config(config)  # Initialize the model structure
        # Copy weights from the origin_model to train_model
        self.bert.load_state_dict(origin_model.state_dict())

    def forward(self, input_ids, attention_mask, token_type_ids, labels):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            labels=labels,
        )
        return outputs

Second Issue: Saved Model Weights Are Different from Trained Model in Memory

The discrepancy between the weights of the saved model and the trained model in memory is due to how the model is being saved and loaded. Hereā€™s why:

  1. Saving the Model:

    • In your code, you are saving the model using a custom approach, but you are not explicitly saving the modelā€™s state. You need to use the save_pretrained() method provided by the Transformers library to ensure that the weights are saved correctly.
  2. Loading the Model:

    • When you load the model using BertForMask(config=saved_config), you are initializing a new model with random weights instead of loading the saved weights. Similar to the first issue, the weights are not being properly restored from the saved checkpoint.

Solution

To correctly save and load the model, follow these steps:

  1. Save the Trained Model:

    train_model.save_pretrained(saved_model_path)
    tokenizer.save_pretrained(saved_model_path)
    
  2. Load the Saved Model:

    saved_model = BertForMask.from_pretrained(saved_model_path)
    saved_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
    

    Modify your BertForMask class to support the from_pretrained() method:

    class BertForMask(PreTrainedModel):
        def __init__(self, config):
            super().__init__(config)
            self.bert = AutoModelForMaskedLM.from_config(config)
    
        def forward(self, input_ids, attention_mask, token_type_ids, labels):
            outputs = self.bert(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                labels=labels,
            )
            return outputs
    
        @classmethod
        def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
            config = AutoConfig.from_pretrained(pretrained_model_name_or_path)
            model = cls(config, *model_args, **kwargs)
            state_dict = kwargs.pop("state_dict", None)
            model.load_state_dict(state_dict)
            return model
    

Final Notes

  • The wrapped model (BertForMask) should be properly initialized with the pretrained weights using from_pretrained().
  • When saving and loading the model, use the save_pretrained() and from_pretrained() methods provided by the Transformers library to ensure that the weights are saved and loaded correctly.

By making these changes, the evaluation results between the two models should be consistent, and the saved model weights should match the trained model in memory.

Based on the information provided and the sources referenced, here are the best practices to make the problems youā€™re experiencing less likely to occur:


Best Practices

1. Proper Initialization of Pretrained Models

  • When wrapping a pretrained model (e.g., BertForMask), ensure it is initialized with the same weights as the original model. This avoids discrepancies between the two models.
  • In your case, initialize the wrapped model with the pretrained weights using load_state_dict() [1][2].

2. Correctly Saving and Loading Models

  • Use the save_pretrained() and from_pretrained() methods provided by the Hugging Face Transformers library. These methods ensure that the modelā€™s weights, configuration, and tokenizer are saved and loaded correctly [1][3].
  • After training, call train_model.save_pretrained(saved_model_path) to save the model.
  • When reloading, use BertForMask.from_pretrained(saved_model_path) to load the saved weights into your custom model.

3. Fine-Tuning andęؔ型 Training

  • When fine-tuning transformer models, carefully prepare your dataset. Ensure that the data is tokenized and formatted correctly for the chosen model (e.g., using a DataCollator) [4].
  • Use warm-up steps and appropriate learning rate scheduling to stabilize training [4].

4. Model Evaluation and Debugging

  • After training, evaluate the model to ensure it meets performance requirements [4].
  • Use logging and debugging tools to monitor training and identify issues early [3].

5. Robust Model Governance

  • Implement robust error handling and debugging mechanisms to ensure smooth model development [3].

Modified Code Incorporating Best Practices

model_checkpoint = "google-bert/bert-base-chinese"
origin_model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
config = origin_model.config
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

class BertForMask(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bert = AutoModelForMaskedLM.from_config(config)
        # Initialize the model with the same weights as origin_model
        self.bert.load_state_dict(origin_model.state_dict())

    def forward(self, input_ids, attention_mask, token_type_ids, labels):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            labels=labels,
        )
        return outputs

from transformers import Trainer

# Ensure the model is properly initialized
train_model = BertForMask(config)
train_model.save_pretrained(saved_model_path)  # Save the model before training

# Define the trainer
trainer = Trainer(
    model=train_model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    data_collator=data_collator,
)

# Train and evaluate
trainer.train()
eval_results = trainer.evaluate()

# Save the trained model
train_model.save_pretrained(saved_model_path)

# Reload the model
saved_model = BertForMask.from_pretrained(saved_model_path)
saved_model_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)

# Verify weights
print(saved_model.bert.bert.encoder.layer[0].attention.self.query.weight)
print(train_model.bert.bert.encoder.layer[0].attention.self.query.weight)

By following these best practices, you can:

  1. Ensure consistent initialization between train_model and origin_model.
  2. Correctly save and load the model to maintain the same weights in memory and on disk.
  3. Improvemodel training and evaluation reliability [1][3][4].

The first problem is solved.
But is seems there is a bug in the solution of the second problem: in the method from_pretrained, state_dict comes from kwargs, but in BertForMask.from_pretrained(saved_model_path) we didnā€™t pass kwargs, so state_dict will always be None. Is this a fault?
By the way, there is one thing I donā€™t understand. The model should be saved automatically during trainer.train(), so why I should save it manually?

1 Like

Ohā€¦ buggy.:sweat_smile: There is no need to save manually. Just assign the path of the model that was automatically saved by Trainer to saved_model_path. Anyway, the problem with the second issue is that you are using from_config, so it is important to use from_pretrained to load actual weight.
Also, I personally think that there is a high possibility that from_pretrained will work even without redefining it.

saved_config = AutoConfig.from_pretrained(saved_model_path)
saved_model = BertForMask(config=saved_config)

to

saved_model = BertForMask.from_pretrained(saved_model_path, trust_remote_code=True) # , trust_remote_code=True is just for to be uploaded. It is not required for reading local weight.

Thank you for your help. I understand the reason of my faults and the weights are the same now.
But there is one problem that confuses me. when calling trainer.evaluate() many times, if I construct trainer by saved model, the results keep the same every time, but if I construct trainer using train_model, the results will change. Why?

1 Like

but if I construct trainer using train_model, the results will change. Why?

I think itā€™s because the dummy weights generated when doing from_config are random rather than consistent zero-fill. (Probably.)

I donā€™t know why random numbers (pseudo-random numbers) are usedā€¦
I guess itā€™s fine as it isā€¦

May be itā€™s not from from_config. In fact I have modified my model as

class BertForMask(PreTrainedModel):
    config_class = PretrainedConfig
    def __init__(self, config):
        super().__init__(config)
        self.bert = origin_model
         
    def forward(self, input_ids, attention_mask, token_type_ids, labels):
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            labels=labels)
        return outputs

Then

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-4",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=True,
    logging_steps=logging_steps,
    save_safetensors=False
)

from transformers import Trainer

train_model = BertForMask(config)
trainer = Trainer(
    model=train_model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    processing_class=tokenizer, 
)

for i in range(5):
    eval_results = trainer.evaluate()
    print(eval_results)
    print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

the results will be different.

1 Like

Ok, I found the reason.
After I reload my local model, I initialize Trainer before I call trainer.evaluate() every time, so results are the same.
But with train_model, I initialize Trainer once and call trainer.evaluate() many times(), so results will change.

1 Like