Refine BERT to pay more attention to key words

Hi all,

I’m trying to use BERT (or any language embedding models) to solve a semantic text similarity problem: given a product A, find product B, which is basically the same underlying product, with a few key differences. For example, “ABC green T-shirt” matches with “ABC green T-shirt (2-count)”; however, “ABC green T-shirt” does NOT match with “ABC red T-shirt (2-count)”. So my goal is to refine BERT to pay more attention to, in this particular case, color, while not losing sight of the more important information, T-shirt.

What I’m doing now is follow Natural Language Inference — Sentence-Transformers documentation

  1. train BERT with correct pairs, such as “ABC green T-shirt” matches with “ABC green T-shirt (2-count)”. There are about 150k training instances.

  2. train 1) model with triplets, such as (“ABC green T-shirt” , “ABC green T-shirt (2-count)”, “ABC red T-shirt (2-count)”. There are about 5k training instances, so overfitting can be a problem here.

After these steps, I saw a slight improvements on matching accuracy. So my question is:

  1. am I on the right direction?

  2. what’re the more up-to-date fine-tuning methods compared to the ones in sentence transformer documentation?

Thank you very much!