I have an unlabeled data dump of 2 million sentences.
I have fine-tuned a roberta-base model with masked language modeling on the first 100k sentences using the implementation described by Lewis Tunstall in this notebook.
Now I want to try to fine-tune the roberta-base model on 250k sentences to compare the effect of training on larger data on a downstream, binary classification task but I have limited compute (Google colab).
I want to know whether the below two approaches result in the same model or not:
-
fine-tuned a roberta-base model with masked language modeling on the first 250k sentences from scratch
-
fine-tuned the model already trained on 100k sentences on the next 150k sentences with masked language modeling
In theory, I believe both should give me the exact same final model. But I want to verify.
Second approach would save me half the compute needed for the first.