Domain adaptation for embeddings - fine tuning on MLM

I would like to create better search functionality for a domain specific language (DSL).

For this, I’m trying to finetune an encoder on the masked language modeling (MLM) objective as described here: Fine-tuning a masked language model - Hugging Face NLP Course. Similar to this question, except I’m leaving the tokenizer as is: Domain adaptation of Language Model and Tokenizer

I’ve tried two base-models: all-MiniLM-L6-v2 and roberta-base and finetuned them on about 30k samples from the DSL with 15% masking.

To evaluate the results, I use some semantically equivalent/similar pairs and see how well I can retrieve one from a line-up by encoding the other. Similar to InfoNCE loss, but with rankings instead of probabilities.

With both models I’ve tried, I find that as the MLM loss decreases (on eval, not just train), the actual metrics I care about get worse. In other words, roberta-base is better at searching for similar DSL pairs than a model that’s actually seen the DSL.

I realize I should also train on an InfoNCE style loss at some point (e.g. Losses — Sentence Transformers documentation), but there’s not enough data for that at the moment.

Shouldn’t there already be some improvement from training on MLM?

Would appreciate any thoughts/pointers/references!

Found an answer here: MLM — Sentence Transformers documentation

Note: Only running MLM will not yield good sentence embeddings. But you can first tune your favorite transformer model with MLM on your domain specific data. Then you can fine-tune the model with the labeled data you have or using other data sets like NLI, Paraphrases, or STS.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.