Custom tokenizer: finetune model or retrain model?

edogarci · March 6, 2024, 7:51pm

Hello there!

I’m working on a side project where I’ve been creating a custom dataset (HF Dataset) and a custom tokenizer (HF Tokenizer).

The dataset contains raw text strings representing all moves done in a chess game (each row is a full game). Using this dataset, I trained a custom tokenizer based on Google Bert (AutoTokenizer.from_pretrained(‘google-bert/bert-base-cased’).

At this point, my plan was to train the model (again Google Bert from pretained, so I used the right tokenizer for its corresponding model), but I have a question…

Should I train the model from the ground, or finetune it? On one hand I think finetune the model should be the right way for most cases (being mine one of them). On the other hand, the vocabulary now available in my custom tokenizer after training it, is way less in size than the original (around 1:10), and therefor (as you can imaging if you are familiar with chess notation), pretty different from english language or any human language.

Any suggestion on what should be the best option? I have choosen google-bert/bert-base-cased because I want to use a fill-mask strategy for training the model.

Best regards,
Edo

Sandy1857 · March 8, 2024, 12:08pm

You might find this discussion helpful:

Topic		Replies	Views
Fine-tune, or train from scratch? Beginners	6	3454	September 16, 2020
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	148	August 30, 2024
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8428	November 14, 2024
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12844	February 12, 2024
Load fine tuned model from local Beginners	4	10287	October 20, 2020

Custom tokenizer: finetune model or retrain model?

Related topics