Adding New Tokens to MarianMT Model

minkhantycc · February 2, 2024, 2:32am

Hey guys,
I am trying to fine-tune a MarianMT model that translates from Japanese to English. I have some new tokens (around 50). So, I made the following steps:

tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))
Then I fine-tuned the model on kde4 dataset. Since MarianMT tokenizer doesn’t have train_from_iterate method, I can’t fine-tune the tokenizer.

My question is "Do I need to re-train the tokenizer? Or Is just fine-tuning the model enough?
Thank you for your time.

kumarme072 · February 2, 2024, 5:34pm

Fine tuning is enough go ahead

kumarme072 · February 2, 2024, 5:34pm

Also I would like to connect with you. I am also working in the same way on different domain

Saugatkafley · February 3, 2024, 7:25am

Did you train the model? I am also working on extending the vocabulary of seq2seq and LLMs . Figuring out how to effectively train them.

minkhantycc · February 3, 2024, 10:54am

@kumarme072 I think so. I fine-tuned the model and it has some improvement. It resulted in 54 BLEU score. It had only 39.1 originally. Now, my problem becomes not having enough corpus that includes added tokens. Anyway thank you for your answer.

minkhantycc · February 3, 2024, 11:03am

Yes I fine-tuned the model with a custom dataset (a combination of 100 sentences and kde4 dataset). I think we should use a corpus that has the vocabularies we added. We should have at least more than 30 sentences for a vocabulary. In my experiments, I added 50 vocabluaries and fine-tuned it on around 100 sentences. It was just a waste of time and I got poor result. So, I combined them with kde4 dataset from huggingface. It caused the model improve from 39.1 to 54 SacreBLEU. What do you think? Do you have any other ideas?

Saugatkafley · February 3, 2024, 4:57pm

Of course , the pretraining corpus and the tokenizer’s ability to tokenized the words(fertility) matter a lot. I also faced similar problems finding a proper model for task-specific fine-tuning. Found out that the presence of related context in the pre-training corpus is essential for better fine-tuning. The model generalized faster and improved.

Saugatkafley · February 3, 2024, 5:01pm

like Fine-tuning a model for a specific language task that contains plenty of data about that language (other than English) was found to be better than fine-tuning a popular model that rarely contained that language.

system · February 4, 2024, 5:01am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Add_tokens + finetune 🤗Transformers	0	533	February 25, 2022
Can we fine-tune right away after add_tokens? Beginners	0	858	January 31, 2023
Dataset parameters to finetune a pretrained translation model on new vocabulary Models	0	369	July 5, 2023
Domain adaptation of Language Model and Tokenizer Beginners	8	2958	June 17, 2024
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4427	February 20, 2022

Adding New Tokens to MarianMT Model

Related topics