Saving Manually Resized Embeddings for a Pretrained Bert Model (I believe I am asking this correctly)

Manthor1227 · November 7, 2024, 12:20am

I’m a machine learning newbie, so I’m sorry if this isn’t incredibly clear but I’m going to try and be as concise as I possibly can.

I’m fine tuning a pretrained BERT model, specficially ‘bert-based-uncased’, and part of this was updating the embedding size of the model so I can add in new tokens, namely emojis, so the tokenizer could properly tokenize said emojis. This was done thusly:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = 2,
                                                      output_attentions = False,
                                                      output_hidden_states = False)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb

This worked, it was able to properly tokenize emojis, and I saved the model so I can load it for future evaluation or further fine tuning. However, in trying to load the model in a separate script for evaluation purposes, I get a tensor size mismatch error. Specifically:

Error(s) in loading state_dict for BertForSequenceClassification:
    size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([33820, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method

Now, I have a workaround, but I think it’s jank and frankly I don’t like it because ideally I’d like to have something scalable I guess? There’s the possibility that I’m going to have to add in new tokens and I dislike the idea of having to manually resize it every time I load the model. I’m unsure of what the right way to, essentially, save the model and it’s current weights are. I’ve thought about creating a new model class that inherits BertForSequenceClassification, just with resized embeddings, but I’m unsure of how to accomplish that.

Here is the workaround:

model = BertForSequenceClassification.from_pretrained(model_dir, num_labels=2, ignore_mismatched_sizes=True)
weights = model.bert.embeddings.word_embeddings.weight.data
new_weights = torch.cat((weights, weights[101:3399]), 0)
new_emb = nn.Embedding.from_pretrained(new_weights, padding_idx=0, freeze=False)
model.bert.embeddings.word_embeddings = new_emb


model.load_state_dict(torch.load(state_dict_dir, weights_only=True))
tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=True)

What is the right way to accomplish what I’m trying to do? If I’m unclear, please say so and I will do my best to try and clear it up.

Topic		Replies	Views
Loading trained model with new vocab Beginners	2	1021	April 10, 2024
Is it possible to use a pre-trained Bert model with a modified type_vocab_size parameter? Beginners	0	686	May 12, 2021
Saving standard BertModel english and BertModel multilingual have drastically different sizes? 🤗Transformers	2	259	August 28, 2020
How to get embedding matrix of bert in hugging face Beginners	8	40339	October 31, 2024
How do i take only "BERT" weights from BertForSequenceClassification model? 🤗Transformers	0	1397	February 16, 2022

Saving Manually Resized Embeddings for a Pretrained Bert Model (I believe I am asking this correctly)

Related topics