LM finetuning on domain specific unlabelled data

Hello Team,
Thanks a lot for the great work!
Can you please tell me how to finetune a(any) MLM model on domain specific corpus ? I am following this link obtained from the huggingface documentation. Is this the procedure I should be following ? if this is how it is done, how will this update the vocabulary to adapt to new tokens of my domain specific corpus ?

Thanks in advance.

1 Like

Can anyone help ?

hey @sML, in general fine-tuning will not update the vocabulary of the tokenizer as the vocabulary is specific to the corpus that the model was pretrained on.

if your domain is not too different from the corpus used to pretrain the model, then i would just try fine-tuning the LM (following the tutorial you lined to) to see what kind of results you get.

one alternative would be to manually add new tokens to the tokenizer’s vocabulary, e.g.

tokenizer = ...
model = ...

new_toks = [tok1, tok2, ..., tokN]
# docs: https://huggingface.co/transformers/internal/tokenization_utils.html?highlight=add_tokens#transformers.tokenization_utils_base.SpecialTokensMixin.add_tokens
added_toks = tokenizer.add_tokens(new_toks)
# you need to resize the dimension of the token embeddings to match the new vocab size

you can find a more sophisticated approach described in this paper: https://www.aclweb.org/anthology/2020.findings-emnlp.129.pdf


Thanks a lot for the clarification. Apparently my domain is completely different from the actual corpus. Adding tokens manually also doesn’t seems feasible. I will go through the paper provided. Thank you once again.

1 Like

If your domain vocabulary is totally different, then you probably do not want to adapt an existing, pretrained LM model to your domain-specific corpus. In this case, it probably makes more sense to train an LM from scratch.

You can check how different your domain is from an existing pretrained vocabulary by running the corresponding pretrained tokenizer against a sample of your domain’s text and counting the number of unknown tokens that show up.

In my experience, the pretrained tokenizers like BPE will not return any unknown tokens if your domain-specific corpus is in the same language (e.g., English).

1 Like

Thankyou for the response!
Okay. My corpus is in English itself, but very much specific to a particular domain. Would it be worth trying MLM fine-tuning with BPE tokenizer, may be like Roberta?

“Vocabulary” as it is used in the context of tokenizers is not the same as the common English definition of “vocabulary”, i.e., a tokenizer doesn’t necessarily break up a text into intelligible English words. You may want to read up on how tokenizers work: