Saving Manually Resized Embeddings for a Pretrained Bert Model (I believe I am asking this correctly)

I’m a machine learning newbie, so I’m sorry if this isn’t incredibly clear but I’m going to try and be as concise as I possibly can.

I’m fine tuning a pretrained BERT model, specficially ‘bert-based-uncased’, and part of this was updating the embedding size of the model so I can add in new tokens, namely emojis, so the tokenizer could properly tokenize said emojis. This was done thusly:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = 2,
                                                      output_attentions = False,
                                                      output_hidden_states = False)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb

This worked, it was able to properly tokenize emojis, and I saved the model so I can load it for future evaluation or further fine tuning. However, in trying to load the model in a separate script for evaluation purposes, I get a tensor size mismatch error. Specifically:

Error(s) in loading state_dict for BertForSequenceClassification:
    size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([33820, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method

Now, I have a workaround, but I think it’s jank and frankly I don’t like it because ideally I’d like to have something scalable I guess? There’s the possibility that I’m going to have to add in new tokens and I dislike the idea of having to manually resize it every time I load the model. I’m unsure of what the right way to, essentially, save the model and it’s current weights are. I’ve thought about creating a new model class that inherits BertForSequenceClassification, just with resized embeddings, but I’m unsure of how to accomplish that.

Here is the workaround:

model = BertForSequenceClassification.from_pretrained(model_dir, num_labels=2, ignore_mismatched_sizes=True)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb


model.load_state_dict(torch.load(state_dict_dir, weights_only=True))
tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=True)

What is the right way to accomplish what I’m trying to do? If I’m unclear, please say so and I will do my best to try and clear it up.

1 Like