So I have been working on a discord bot, which tries to detect “fud” sentences in message history.
I tried to use cosine similarity which is quite fast, but as stated here: nlp - String comparison with BERT seems to ignore "not" in sentence - Stack Overflow
, it is not the best for this application where you want more semantic information then this embedding and metric can provide.
The code:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")
def test_entailment(text1, text2):
batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
with torch.no_grad():
proba = torch.softmax(model(**batch).logits, -1)
return proba.cpu().numpy()[0, model.config.label2id['ENTAILMENT']]
def test_equivalence(text1, text2):
return test_entailment(text1, text2) * test_entailment(text2, text1)
print(test_equivalence("I'm a good person", "I'm not a good person")) # 2.0751484e-07
print(test_equivalence("I'm a good person", "You are a good person")) # 0.49342492
print(test_equivalence("I'm a good person", "I'm not a bad person")) # 0.94236994
It works fine but quite slow. I have a corpus which contains only 50 example fud sentences, which I want to compare user sentences with.
I have to go through thousands of messages and this way it’s quite slow (approx. 1000 message * 3-4 sentence pieces * 50 corpus sentence ).
So my question more specifically: is there a way to pre-embed the corpus sentences and reuse the embeddings when I calculate the score? And also is there a way to batch compare an example sentence from user message with all corpus embedding at once?
I can utilize an rtx3090 for this task, so I have 24 gigs of memory to utilize.
Thank you in advance! I really appreciate the solution provided in the stackoverflow post also!