Create entirely new vocabulary for tokenizer

Svangorden13 · May 30, 2024, 3:48pm

I am building an encoder-decoder model based off facebook-bart-base for the purpose of solving math problems. I would like to train the model so that the decoder can only output a small set of words (e.g. “multiply”, “divide”, “add”, “subtract”, etc.) and numbers. Is it possible to completely redefine the corpus used by the decoder tokenizer, rather than just adding new tokens to it?

Topic		Replies	Views
Does training tokenizer and adding new token to model when training BART on custom dataset improve performance? Beginners	3	889	May 1, 2023
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
BertTokenizer.decode not understanding new vocabulary 🤗Tokenizers	0	349	December 1, 2023
Using the decoder half of BART for causal generation Models	4	2779	May 2, 2022
BART with custom encoder and decoder Models	5	921	May 25, 2023

Create entirely new vocabulary for tokenizer

Related topics