Fine tune vocab size of pre-trained Causal Language Model

franfram · October 9, 2022, 10:44am

Hello everyone,
I’d like to know if there’s a way to adjust the vocab size of a pre-trained Causal Language Model so that, for example, instead of being able to predict one of 50k words, it will be able to predict only n words (with n being predefined). Is this possible?

Thanks for your time,

thies · October 11, 2022, 12:53pm

In general you would have to replace the embedding matrix. In this matrix the embeddings of all the vocabulary items are stored. The most easy way is to define a new matrix (Embedding — PyTorch 1.12 documentation) of the correct size (n x hidden) and initialize the vectores randomly. The problem is that this ruins (more or less) the entire model.

If there is large overlap between old and new matrix (vocabularies are similar) you can initialize the new embeddings by the corresponding old ones.

I used this example once:

franfram · October 17, 2022, 1:20pm

thanks thies!

Topic		Replies	Views
How to fine-tune a subset of the vocabulary? Intermediate	0	326	April 29, 2021
How to finetune/instruction-tune a large language model on a QA corpus? Intermediate	1	1929	January 20, 2024
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2139	February 6, 2023
Explicit inputs_embeds and vocab_size=1 in GPT2 Beginners	0	174	February 12, 2024
Fine tuning and retokenizing Beginners	0	589	May 29, 2022

Fine tune vocab size of pre-trained Causal Language Model

Related topics