Custom tokenizer: finetune model or retrain model?

Hello there!

I’m working on a side project where I’ve been creating a custom dataset (HF Dataset) and a custom tokenizer (HF Tokenizer).

The dataset contains raw text strings representing all moves done in a chess game (each row is a full game). Using this dataset, I trained a custom tokenizer based on Google Bert (AutoTokenizer.from_pretrained(‘google-bert/bert-base-cased’).

At this point, my plan was to train the model (again Google Bert from pretained, so I used the right tokenizer for its corresponding model), but I have a question…

Should I train the model from the ground, or finetune it? On one hand I think finetune the model should be the right way for most cases (being mine one of them). On the other hand, the vocabulary now available in my custom tokenizer after training it, is way less in size than the original (around 1:10), and therefor (as you can imaging if you are familiar with chess notation), pretty different from english language or any human language.

Any suggestion on what should be the best option? I have choosen google-bert/bert-base-cased because I want to use a fill-mask strategy for training the model.

Best regards,
Edo

You might find this discussion helpful: