I want to fine-tune RoBERTa for MLM on a dataset of about 200k texts. The texts are reviews from online forums ranging from basic conversations to technical descriptions with a very specific vocabulary.
I have two questions regarding data preparation:
- Can I simply use
RobertaTokenizer.from_pretrained("roberta-base")even if the vocabulary of my fine-tuning corpus might differ significantly from the pre-training corpus? Or is there a way to “adjust” the tokenizer to the new data?
- Each review comes with the title of the thread it has been posted in. From earlier experiments I know that concatenating titles and texts (and adding a special separator token between them) improves model performance for classification. However, I am wondering how this should be handled during language model fine-tuning? Since some threads contain hundreds of reviews, it seems wasteful for the language model to predict on the same title over and over again.