Transformer similarity (fine-tuned on classification) too sensitive

ialuronico · February 17, 2022, 9:08am

I fine-tuned a transformer for classification to compute similarity between names. This is a toy example:

 name0 name1 label
 Test  Test  y
 Test  Hi    n

I fined-tuned the model using the label and feeding it with pairs of names as tokenizer allows to feed 2 pieces of text.

I found a really weird behavior. At prediction times, there exist pairs that have very high chances to be predicted as similar just because they have repeated words. For example,

   name0        name1       label
   Hi Hi Hi     dsfds       ?

has a high chance to be predicted as y!

In general there exist some names that no matter what you pair them with, the pairs gets predicted as y.

Did anyone notice this behavior?

At the moment, I am trying to augment my data with:

Empty names
Random chars (always the same)

E.g.

 name0 name1 label
 Test        n
       Test  n
 Test  dsfsd n
 dsfsd Test  n

Unfortunately, I still see the same behavior.

lucadini · February 22, 2022, 3:34pm

I’m not directly answering your question, but have you tried adding some negative examples of these cases in your training set?
For example, adding the tuples (‘Test’, ‘’, n) (‘dsfs’, ‘Test’, n) and some others like these.

At least for the empty names, the model should learn this!

ialuronico · March 6, 2022, 9:35am

Yes, it actually helps a bit. Though, I sometimes see strange behavior with random strings of chars. I guess I am fine tuning on too little data.

Topic		Replies	Views
Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice 🤗Transformers	2	1198	April 20, 2024
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	504	February 13, 2023
Sequence Classification -- Fine Tune? Beginners	3	3136	January 31, 2021
Fine tune Transformers for text generation 🤗Transformers	11	11984	July 27, 2023
How Labelled Data is Processed \| Transformers Trainer 🤗Transformers	10	4157	April 16, 2024

Transformer similarity (fine-tuned on classification) too sensitive

Related topics