Test if a sentence is different from the training data

Is there a way for sbert or other models to flag if new training data is similar to existing training data?

I am trying to use models to determine if new sentences that are to be used for training are similar to the already existing training data or if they are uniquely different.

An example I can think of is I have a corpus of strings all talking about walls and bricks. Adding another brick in the wall doesn’t set off the flag but painting it black would as it isn’t similar to the training corpus.

1 Like