Is there a way to correctly load a pre-trained transformers model without the configuration file?

I would like to fine-tune a pre-trained transformers model on Question Answering. The model was pre-trained on large engineering & science related corpora.

I have been provided a “checkpoint.pt” file containing the weights of the model. They have also provided me with a “bert_config.json” file but I am not sure if this is the correct configuration file.

from transformers import AutoModel, AutoTokenizer, AutoConfig

MODEL_PATH = "./checkpoint.pt"
config = AutoConfig.from_pretrained("./bert_config.json")
model = AutoModel.from_pretrained(MODEL_PATH, config=config)

The reason I believe that bert_config.json doesn’t match “./checkpoint.pt” file is that, when I load the model with the code above, I get the error that goes as below.

Some weights of the model checkpoint at ./aerobert/phase2_ckpt_4302592.pt were not used when initializing BertModel: [‘files’, ‘optimizer’, ‘model’, ‘master params’]

  • This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertModel were not initialized from the model checkpoint at ./aerobert/phase2_ckpt_4302592.pt and are newly initialized: [‘encoder.layer.2.attention.output.LayerNorm.weight’, ‘encoder.layer.6.output.LayerNorm.bias’, ‘encoder.layer.7.intermediate.dense.bias’, ‘encoder.layer.2.output.LayerNorm.bias’, ‘encoder.layer.21.attention.self.value.bias’, ‘encoder.layer.11.attention.self.value.bias’, …

If I am correct in assuming that “bert_config.json” is not the correct one, is there a way to load this model correctly without the config.json file?

This is telling you that the checkpoint that they gave you also includes the state of other things. So they also saved the state of the optimizer and not just the state of the model. It seems that you need to only load the “model” key. Maybe there is a better way than this, but I think you can do:


MODEL_PATH = "./checkpoint.pt"
state_dict = torch.load(MODEL_PATH)["model"]
config = AutoConfig.from_pretrained("./bert_config.json")
model = BertModel(config)

model = BertModel._load_state_dict_into_model(
    model,
    state_dict,
    MODEL_PATH
)[0]

# make sure token embedding weights are still tied if needed
model.tie_weights()

# Set model in evaluation mode to deactivate DropOut modules by default
model.eval()

I did not test this. See this for more:

1 Like

You are absolutely correct, checkpoint also includes the states of other things. I hadn’t noticed this! I have checked the keys with the code below:

MODEL_PATH = "./aerobert/phase2_ckpt_4302592.pt"
keys = torch.load(MODEL_PATH).keys()
keys

Output: dict_keys([‘model’, ‘optimizer’, ‘master params’, ‘files’])

If I look at the the files, there are quite a few files as below:

[3,
‘/local_workspace_data/bert/part-00879-of-00500.hdf5’,
‘/local_workspace_data/bert/part-00562-of-00500.hdf5’,
‘/local_workspace_data/bert/part-01703-of-00500.hdf5’,
‘/local_workspace_data/bert/part-01706-of-00500.hdf5’,
…]

If I run your code below, it produces an error:

MODEL_PATH = "./checkpoint.pt"
state_dict = torch.load(MODEL_PATH)["model"]
config = AutoConfig.from_pretrained("./bert_config.json")
model = BertModel(config)

model = BertModel._load_state_dict_into_model(
    model,
    state_dict,
    MODEL_PATH
)[0]

The error:

RuntimeError: Error(s) in loading state_dict for BertModel:
size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30528, 1024]) from checkpoint, the shape in current model is torch.Size([30522, 1024]).

Does this mean the vocabulary of the saved model has 6 additional words?

The “files” are probably the ones that they used for training. So you get the paths to the files, but you probably do not have access to that. But that should not matter as you won’t need that data anymore.

Yes, that is correct. This information should also be in the config file under a key “vocab_size”. But it might be that they gave you the wrong config, as you state in your first post. Best to ask them for the correct one so that you are sure that the other parameters are correct as well.

1 Like

The configuration file I have been provided is showing “vocab_size”: 30522, same as original BERT. I will try to obtain the correct config file.

{
“attention_probs_dropout_prob”: 0.1,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 1024,
“initializer_range”: 0.02,
“intermediate_size”: 4096,
“max_position_embeddings”: 512,
“num_attention_heads”: 16,
“num_hidden_layers”: 24,
“type_vocab_size”: 2,
“vocab_size”: 30522
}

If vocab size is different, do you think I might also need the trained tokenizer?

I think you should be able to do

model.resize_token_embeddings(30528) 

before you load the state dict. The state should then load successfully. However, as you point out, it is likely that they added tokens to the tokenizer, so you should get their tokenizer files as well. Then it would be as simple as:

MODEL_PATH = "./checkpoint.pt"
state_dict = torch.load(MODEL_PATH)["model"]
config = AutoConfig.from_pretrained("./bert_config.json")
tokenizer = <load tokenizer here>
model = BertModel(config)

model.resize_token_embeddings(len(tokenizer)) 

model = BertModel._load_state_dict_into_model(
    model,
    state_dict,
    MODEL_PATH
)[0]

# make sure token embedding weights are still tied if needed
model.tie_weights()

# Set model in evaluation mode to deactivate DropOut modules by default
model.eval()
1 Like

I have a related question regarding loading saved checkpoints. Is it possible to load checkpoints from the Trainer with from_pretrained?

I am asking since I am trying to instantiate an EncoderDecoderModel from checkpoints of two language models as follows and somehow the resulting EncoderDecoderModel is not behaving as expected: It crashes when calling .generate on sequences longer than the decoder max seq length while the encoder has much longer input span and should be able to handle that input length:

encdec_model = EncoderDecoderModel.from_encoder_decoder_pretrained(
        "../models/pretrained/enc/checkpoint-540000/", 
        "../models/pretrained/dec/checkpoint-1820000/"
)

config_encoder = encdec_model.config.encoder
config_decoder  = encdec_model.config.decoder
# set decoder config to causal lm
config_decoder.is_decoder=True
config_decoder.add_cross_attention=True

encoder config:

{
  "architectures": [
    "BigBirdForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 10240,
  "model_type": "big_bird",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_random_blocks": 8,
  "pad_token_id": 0,
  "rescale_embeddings": false,
  "sep_token_id": 66,
  "transformers_version": "4.8.2",
  "type_vocab_size": 2,
  "use_bias": true,
  "use_cache": true,
  "vocab_size": 32000
}

decoder config:

{
  "architectures": [
    "BigBirdForCausalLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "original_full",
  "block_size": 64,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": true,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "big_bird",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_random_blocks": 3,
  "pad_token_id": 0,
  "rescale_embeddings": false,
  "sep_token_id": 66,
  "transformers_version": "4.8.2",
  "type_vocab_size": 2,
  "use_bias": true,
  "use_cache": true,
  "vocab_size": 32000
}

I think I am not loading the weights in the EncoderDecoderModel correctly but I am not sure what is the correct way. Do I need to pass the configs as well? I would appreciate your help. Thank you!