TL;DR The vocabulary size changes the number of parameters of the model. If we were to compare models with different vocabulary sizes, what would be the most fair strategy, fixing the total number of parameters or having the same architecture with same number of layers, attention heads, etc.?
We have a set of mini models which are pretrained from scratch using the Roberta architecture. The number of layers, hidden sizes, and number of attention heads correspond to that of the mini models in the BERT paper. We wanted to experiment with the effect of different tokenization algorithms on the downstream performance and to this end, fit BPE, WordPiece, and WordLevel tokenizers with 50K, 50K, 100K vocabulary sizes respectively in addition to character-based tokenization. The reason for the increased vocabulary size for the WordLevel tokenization is to decrease the number of OOV tokens.
Later did we notice that the difference between vocabulary sizes cause a huge difference between the number of parameters. The model sizes are 20.4M, 20.4M, 33.2M, and 8.1M for BPE, WordPiece, WordLevel, and char tokenizer-based models respectively. This means that the percentages of the number of parameters coming from the vocabulary of the model are 63%, 63%, 77%, and 1% for BPE, WordPiece, WordLevel, and char tokenizer-based models respectively.
My question is, is it unfair to compare the downstream performance of these models on the same task with the same dataset just because the number of parameters are different. I would assume that in a given forward pass through an input, only a very small part of the vocabulary is updated. Because, a parameter in a layer of the transformer blocks of the model is updated every step, whereas a parameter in the vocabulary is updated whenever it appears in the input text. Therefore, to say that, for example, 100K parameters from the vocabulary of WordLevel tokenizer-based model contribute to the computation of an input is not true. This means that for as long as the number of parameters in the transformer blocks of the models are comparable, it is fair to compare the performance of the models. If this assumption is incorrect, I would be happy to be corrected.
Thanks for your time.