LM fine-tuning on unlabelled dataset

Hello Team,

Can you please tell me how to finetune a(any) MLM model on domain specific corpus ? I am following this link obtained from the huggingface documentation. Is this the procedure I should be following ? if this is how it is done, how will this update the vocabulary to adapt to new tokens of my domain specific corpus ?

Thanks in advance.

1 Like