RoBERTa MLM fine-tuning


I want to fine-tune RoBERTa for MLM on a dataset of about 200k texts. The texts are reviews from online forums ranging from basic conversations to technical descriptions with a very specific vocabulary.

I have two questions regarding data preparation:

  • Can I simply use RobertaTokenizer.from_pretrained("roberta-base") even if the vocabulary of my fine-tuning corpus might differ significantly from the pre-training corpus? Or is there a way to “adjust” the tokenizer to the new data?
  • Each review comes with the title of the thread it has been posted in. From earlier experiments I know that concatenating titles and texts (and adding a special separator token between them) improves model performance for classification. However, I am wondering how this should be handled during language model fine-tuning? Since some threads contain hundreds of reviews, it seems wasteful for the language model to predict on the same title over and over again.

Hello there,

I am currently trying to do the same : fine-tune Roberta on a very specific vocabulary of mine (let’s say : biology stuff).

About your first question, you should at least add some new words, specific to your vocabulary, in the Tokenizer vocabulary. See this discussion : how can i finetune BertTokenizer? · Issue #2691 · huggingface/transformers · GitHub

Considering the MLM training, what class did you use exactly ? I am looking for more info online, found this (NLP-with-Deep-Learning/fine_tuning_bert_with_MLM.ipynb at master · yash-007/NLP-with-Deep-Learning · GitHub) but wonder how this would work for Roberta.