How the vocabulary of BERT tokenizer is generated?

kumarme072 · January 6, 2024, 11:55am

I want to know this vocabulary of BERT tokenizer generation process, how actually its created whether its the training that creates this vocabulary or its created manually by selecting potential words or what ? during the training as my understanding the model will keep updating its tokens embeddings. and also there is a process of adding tokens.
so in conclusion please answer below two questions and any further explanation would be appriciated.
What’s the limit of extension of tokens/vocab? what’s the best way to initialize the new tokens embeddings?

nielsr · January 6, 2024, 2:00pm

Hi,

One typically takes a certain representative portion of the corpus (text training data) on which one runs a so-called tokenization algorithm. Popular tokenization algorithms include WordPiece, SentencePiece, BPE (byte-pair encoding). This tokenization algorithm outputs a vocabulary, which is just a list of tokens (typical vocabularies include about 30k tokens). The vocabulary is created based on frequency of the tokens in the text. This is why Hugging Face built the Tokenizers library, to “train” these tokenization algorithms on your text. By training, we simply mean creating this vocabulary based on frequency.

Once we have the vocabulary, we can use it to tokenize texts (by simply looking up the tokens in the vocabulary which is then matched against text) and we can start training a PyTorch/JAX model.

kumarme072 · January 6, 2024, 6:18pm

So we have two ways of having a vocab

take the already built vocab by BERT tokenizer pretrainig and fine tune … easy way…
we generate another vocab by any of three of methods wordpiece/BPE/Sentpiece etc .

Now question arise if i want to fine tune a model on my own dataset what should i do generate my own vocab for tokenizer and takes pretrained weights(Embeddings). If I do so then the distribution of the pretrained model will be disturbed because i have already replaced the tokens for which the model has embeddings.

please help in this what should I do?

Is there any better way, I need to fine tune for better results on medical datasets and suggest the models if you know.
Related to the same any further articles/webpages/blogs would be really helpful I just need something that can help learn and get going in right direction.

Thank you so much for your valuable response.

Topic		Replies	Views
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3763	April 21, 2021
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4639	January 22, 2021
Train wordpiece from scratch 🤗Tokenizers	2	1436	September 9, 2021
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	148	August 30, 2024
Embeddings of added words Intermediate	1	746	September 9, 2022

How the vocabulary of BERT tokenizer is generated?

Related topics