Fine tune vocab size of pre-trained Causal Language Model

Hello everyone,
I’d like to know if there’s a way to adjust the vocab size of a pre-trained Causal Language Model so that, for example, instead of being able to predict one of 50k words, it will be able to predict only n words (with n being predefined). Is this possible?

Thanks for your time,

1 Like

In general you would have to replace the embedding matrix. In this matrix the embeddings of all the vocabulary items are stored. The most easy way is to define a new matrix (Embedding — PyTorch 1.12 documentation) of the correct size (n x hidden) and initialize the vectores randomly. The problem is that this ruins (more or less) the entire model.

If there is large overlap between old and new matrix (vocabularies are similar) you can initialize the new embeddings by the corresponding old ones.

I used this example once:

thanks thies!