Domain adaptation for embeddings - fine tuning on MLM

I would like to create better search functionality for a domain specific language (DSL).

For this, I’m trying to finetune an encoder on the masked language modeling (MLM) objective as described here: Fine-tuning a masked language model - Hugging Face NLP Course. Similar to this question, except I’m leaving the tokenizer as is: Domain adaptation of Language Model and Tokenizer

I’ve tried two base-models: all-MiniLM-L6-v2 and roberta-base and finetuned them on about 30k samples from the DSL with 15% masking.

To evaluate the results, I use some semantically equivalent/similar pairs and see how well I can retrieve one from a line-up by encoding the other. Similar to InfoNCE loss, but with rankings instead of probabilities.

With both models I’ve tried, I find that as the MLM loss decreases (on eval, not just train), the actual metrics I care about get worse. In other words, roberta-base is better at searching for similar DSL pairs than a model that’s actually seen the DSL.

I realize I should also train on an InfoNCE style loss at some point (e.g. Losses — Sentence Transformers documentation), but there’s not enough data for that at the moment.

Shouldn’t there already be some improvement from training on MLM?

Would appreciate any thoughts/pointers/references!

Found an answer here: MLM — Sentence Transformers documentation

Note: Only running MLM will not yield good sentence embeddings. But you can first tune your favorite transformer model with MLM on your domain specific data. Then you can fine-tune the model with the labeled data you have or using other data sets like NLI, Paraphrases, or STS.

