LM finetuning on domain specific unlabelled data

sML · April 12, 2021, 11:54am

Hello Team,
Thanks a lot for the great work!
Can you please tell me how to finetune a(any) MLM model on domain specific corpus ? I am following this link obtained from the huggingface documentation. Is this the procedure I should be following ? if this is how it is done, how will this update the vocabulary to adapt to new tokens of my domain specific corpus ?

Thanks in advance.

sML · April 17, 2021, 8:18am

Can anyone help ?

lewtun · April 17, 2021, 4:18pm

hey @sML, in general fine-tuning will not update the vocabulary of the tokenizer as the vocabulary is specific to the corpus that the model was pretrained on.

if your domain is not too different from the corpus used to pretrain the model, then i would just try fine-tuning the LM (following the tutorial you lined to) to see what kind of results you get.

one alternative would be to manually add new tokens to the tokenizer’s vocabulary, e.g.

tokenizer = ...
model = ...

new_toks = [tok1, tok2, ..., tokN]
# docs: https://huggingface.co/transformers/internal/tokenization_utils.html?highlight=add_tokens#transformers.tokenization_utils_base.SpecialTokensMixin.add_tokens
added_toks = tokenizer.add_tokens(new_toks)
# you need to resize the dimension of the token embeddings to match the new vocab size
model.resize_token_embeddings(len(tokenizer))

you can find a more sophisticated approach described in this paper: https://www.aclweb.org/anthology/2020.findings-emnlp.129.pdf

hth!

sML · April 17, 2021, 4:36pm

Thanks a lot for the clarification. Apparently my domain is completely different from the actual corpus. Adding tokens manually also doesn’t seems feasible. I will go through the paper provided. Thank you once again.

clin · April 21, 2021, 3:56pm

If your domain vocabulary is totally different, then you probably do not want to adapt an existing, pretrained LM model to your domain-specific corpus. In this case, it probably makes more sense to train an LM from scratch.

You can check how different your domain is from an existing pretrained vocabulary by running the corresponding pretrained tokenizer against a sample of your domain’s text and counting the number of unknown tokens that show up.

In my experience, the pretrained tokenizers like BPE will not return any unknown tokens if your domain-specific corpus is in the same language (e.g., English).

sML · April 21, 2021, 4:36pm

Thankyou for the response!
Okay. My corpus is in English itself, but very much specific to a particular domain. Would it be worth trying MLM fine-tuning with BPE tokenizer, may be like Roberta?

clin · April 21, 2021, 5:54pm

“Vocabulary” as it is used in the context of tokenizers is not the same as the common English definition of “vocabulary”, i.e., a tokenizer doesn’t necessarily break up a text into intelligible English words. You may want to read up on how tokenizers work:

Topic		Replies	Views
LM fine-tuning on unlabelled dataset Beginners	0	444	April 10, 2021
Domain adaptation of Language Model and Tokenizer Beginners	8	2889	June 17, 2024
Train MLM on my own domain and fine tune on downstream classification task Intermediate	3	1016	April 16, 2024
Domain adaptation for embeddings - fine tuning on MLM Beginners	2	498	July 12, 2024
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4405	February 20, 2022

LM finetuning on domain specific unlabelled data

Related topics