Hey guys,
I am trying to fine-tune a MarianMT model that translates from Japanese to English. I have some new tokens (around 50). So, I made the following steps:
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))
Then I fine-tuned the model on kde4 dataset. Since MarianMT tokenizer doesn’t have train_from_iterate method, I can’t fine-tune the tokenizer.
My question is "Do I need to re-train the tokenizer? Or Is just fine-tuning the model enough?
Thank you for your time.
@kumarme072 I think so. I fine-tuned the model and it has some improvement. It resulted in 54 BLEU score. It had only 39.1 originally. Now, my problem becomes not having enough corpus that includes added tokens. Anyway thank you for your answer.
Yes I fine-tuned the model with a custom dataset (a combination of 100 sentences and kde4 dataset). I think we should use a corpus that has the vocabularies we added. We should have at least more than 30 sentences for a vocabulary. In my experiments, I added 50 vocabluaries and fine-tuned it on around 100 sentences. It was just a waste of time and I got poor result. So, I combined them with kde4 dataset from huggingface. It caused the model improve from 39.1 to 54 SacreBLEU. What do you think? Do you have any other ideas?
Of course , the pretraining corpus and the tokenizer’s ability to tokenized the words(fertility) matter a lot. I also faced similar problems finding a proper model for task-specific fine-tuning. Found out that the presence of related context in the pre-training corpus is essential for better fine-tuning. The model generalized faster and improved.
like Fine-tuning a model for a specific language task that contains plenty of data about that language (other than English) was found to be better than fine-tuning a popular model that rarely contained that language.