I want to use a pretrained Transformer to do sentence similarity in a specific domain (automotive). I have a domain specific ontology and want to match parameters from different xml data with the ontology entries.
From what ive learned there are many possibilities to do finetuning on models. But im still not sure what the “best” procedure would be in my case.
a. Finetune it on unlabeled in-domain text data
b. Finetune it on labeled sentencepairs but not on this target domain (lack of data in this domain)
or c.:take every entry of the ontology and do data augmentation with WordNet and co. Then take the new data and manually label it and use the labeled in-domain data as finetune input. But because of my lack of experience in practical NLP im not sure if this could produce better results then standard bert and if there could be a bias?
As you can tell im new to NLP. Maybe there is someone with similar challenges.