I’m a machine learning newbie, so I’m sorry if this isn’t incredibly clear but I’m going to try and be as concise as I possibly can.
I’m fine tuning a pretrained BERT model, specficially ‘bert-based-uncased’, and part of this was updating the embedding size of the model so I can add in new tokens, namely emojis, so the tokenizer could properly tokenize said emojis. This was done thusly:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels = 2,
output_attentions = False,
output_hidden_states = False)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb
This worked, it was able to properly tokenize emojis, and I saved the model so I can load it for future evaluation or further fine tuning. However, in trying to load the model in a separate script for evaluation purposes, I get a tensor size mismatch error. Specifically:
Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([33820, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method
Now, I have a workaround, but I think it’s jank and frankly I don’t like it because ideally I’d like to have something scalable I guess? There’s the possibility that I’m going to have to add in new tokens and I dislike the idea of having to manually resize it every time I load the model. I’m unsure of what the right way to, essentially, save the model and it’s current weights are. I’ve thought about creating a new model class that inherits BertForSequenceClassification, just with resized embeddings, but I’m unsure of how to accomplish that.
Here is the workaround:
model = BertForSequenceClassification.from_pretrained(model_dir, num_labels=2, ignore_mismatched_sizes=True)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb
model.load_state_dict(torch.load(state_dict_dir, weights_only=True))
tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=True)
What is the right way to accomplish what I’m trying to do? If I’m unclear, please say so and I will do my best to try and clear it up.