Hi all,
I’m trying to use BERT (or any language embedding models) to solve a semantic text similarity problem: given a product A, find product B, which is basically the same underlying product, with a few key differences. For example, “ABC green T-shirt” matches with “ABC green T-shirt (2-count)”; however, “ABC green T-shirt” does NOT match with “ABC red T-shirt (2-count)”. So my goal is to refine BERT to pay more attention to, in this particular case, color, while not losing sight of the more important information, T-shirt.
What I’m doing now is follow Natural Language Inference — Sentence-Transformers documentation
-
train BERT with correct pairs, such as “ABC green T-shirt” matches with “ABC green T-shirt (2-count)”. There are about 150k training instances.
-
train 1) model with triplets, such as (“ABC green T-shirt” , “ABC green T-shirt (2-count)”, “ABC red T-shirt (2-count)”. There are about 5k training instances, so overfitting can be a problem here.
After these steps, I saw a slight improvements on matching accuracy. So my question is:
-
am I on the right direction?
-
what’re the more up-to-date fine-tuning methods compared to the ones in sentence transformer documentation?
Thank you very much!