I fine-tuned a transformer for classification to compute similarity between names. This is a toy example:
name0 name1 label
Test Test y
Test Hi n
I fined-tuned the model using the label and feeding it with pairs of names as tokenizer
allows to feed 2 pieces of text.
I found a really weird behavior. At prediction times, there exist pairs that have very high chances to be predicted as similar just because they have repeated words. For example,
name0 name1 label
Hi Hi Hi dsfds ?
has a high chance to be predicted as y
!
In general there exist some names that no matter what you pair them with, the pairs gets predicted as y
.
Did anyone notice this behavior?
At the moment, I am trying to augment my data with:
- Empty names
- Random chars (always the same)
E.g.
name0 name1 label
Test n
Test n
Test dsfsd n
dsfsd Test n
Unfortunately, I still see the same behavior.