We have lots of domain-specific data (200M+ data points, each document having ~100 to ~500 words). We wanted to have a domain-specific LM.
We took some sample data points (2M+) & fine-tuned RoBERTa-base using the Mask Language Modelling (MLM) task.
So far
- we did 4-5 epochs (512 sequence length, batch-size=48)
- used cosine learning rate scheduler (2-3 cycles/epochs)
- We used dynamin masking (masked 15% tokens)
Since the RoBERTa model is finetuned on domain-specific data, we do expect this model to perform better than the pre-trained-RoBERTa which is trained on general texts (wiki data, books, etc)
We did perform some tasks like Named Entity Recognition (NER), Text Classification, and Embedding generation to perform cosine similarity tasks. We did this on both finetuned domain-specific RoBERTa and pre-trained-RoBERTa.
Surprisingly, the results are the same (very small difference) for both models. We did try Spacy models too, but the results are same.
Perplexity scores indicate that finetuned MLM-based RoBERTa has a minimal loss.
Can anyone please help us understand why MLM based model is NOT performing better?
- should we go for more data OR more epochs OR both, to see some effect?
- are we doing anything wrong here? Let me know if any required details are missing. I will update
any suggestions OR any valuable links addressing these concerns would be really helpful