Fine-tuned MLM based RoBERTa not improving performance

We have lots of domain-specific data (200M+ data points, each document having ~100 to ~500 words). We wanted to have a domain-specific LM.

We took some sample data points (2M+) & fine-tuned RoBERTa-base using the Mask Language Modelling (MLM) task.

So far

  1. we did 4-5 epochs (512 sequence length, batch-size=48)
  2. used cosine learning rate scheduler (2-3 cycles/epochs)
  3. We used dynamin masking (masked 15% tokens)

Since the RoBERTa model is finetuned on domain-specific data, we do expect this model to perform better than the pre-trained-RoBERTa which is trained on general texts (wiki data, books, etc)

We did perform some tasks like Named Entity Recognition (NER), Text Classification, and Embedding generation to perform cosine similarity tasks. We did this on both finetuned domain-specific RoBERTa and pre-trained-RoBERTa.

Surprisingly, the results are the same (very small difference) for both models. We did try Spacy models too, but the results are same.

Perplexity scores indicate that finetuned MLM-based RoBERTa has a minimal loss.

Can anyone please help us understand why MLM based model is NOT performing better?

  1. should we go for more data OR more epochs OR both, to see some effect?
  2. are we doing anything wrong here? Let me know if any required details are missing. I will update

any suggestions OR any valuable links addressing these concerns would be really helpful


Iā€™m not sure why they perform the same, but maybe by looking at the FP samples for both models in the test set you might see a noticeable trade-off between the generalization and overfitting.

@phosseini: Could you offer some assistance here, please? Do you have any ideas or suggestions?