How the vocabulary of BERT tokenizer is generated?

I want to know this vocabulary of BERT tokenizer generation process, how actually its created whether its the training that creates this vocabulary or its created manually by selecting potential words or what ? during the training as my understanding the model will keep updating its tokens embeddings. and also there is a process of adding tokens.
so in conclusion please answer below two questions and any further explanation would be appriciated.
What’s the limit of extension of tokens/vocab? what’s the best way to initialize the new tokens embeddings?


One typically takes a certain representative portion of the corpus (text training data) on which one runs a so-called tokenization algorithm. Popular tokenization algorithms include WordPiece, SentencePiece, BPE (byte-pair encoding). This tokenization algorithm outputs a vocabulary, which is just a list of tokens (typical vocabularies include about 30k tokens). The vocabulary is created based on frequency of the tokens in the text. This is why Hugging Face built the Tokenizers library, to “train” these tokenization algorithms on your text. By training, we simply mean creating this vocabulary based on frequency.

Once we have the vocabulary, we can use it to tokenize texts (by simply looking up the tokens in the vocabulary which is then matched against text) and we can start training a PyTorch/JAX model.

So we have two ways of having a vocab

  1. take the already built vocab by BERT tokenizer pretrainig and fine tune … easy way…
  2. we generate another vocab by any of three of methods wordpiece/BPE/Sentpiece etc .

Now question arise if i want to fine tune a model on my own dataset what should i do generate my own vocab for tokenizer and takes pretrained weights(Embeddings). If I do so then the distribution of the pretrained model will be disturbed because i have already replaced the tokens for which the model has embeddings.

please help in this what should I do?

Is there any better way, I need to fine tune for better results on medical datasets and suggest the models if you know.
Related to the same any further articles/webpages/blogs would be really helpful I just need something that can help learn and get going in right direction.

Thank you so much for your valuable response.