NLI speed up with batch and pre-embed corpus

domeemod · October 30, 2021, 6:37pm

So I have been working on a discord bot, which tries to detect “fud” sentences in message history.

I tried to use cosine similarity which is quite fast, but as stated here: nlp - String comparison with BERT seems to ignore "not" in sentence - Stack Overflow
, it is not the best for this application where you want more semantic information then this embedding and metric can provide.
The code:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

def test_entailment(text1, text2):
    batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
    with torch.no_grad():
        proba = torch.softmax(model(**batch).logits, -1)
    return proba.cpu().numpy()[0, model.config.label2id['ENTAILMENT']]

def test_equivalence(text1, text2):
    return test_entailment(text1, text2) * test_entailment(text2, text1)

print(test_equivalence("I'm a good person", "I'm not a good person"))  # 2.0751484e-07
print(test_equivalence("I'm a good person", "You are a good person"))  # 0.49342492
print(test_equivalence("I'm a good person", "I'm not a bad person"))   # 0.94236994

It works fine but quite slow. I have a corpus which contains only 50 example fud sentences, which I want to compare user sentences with.
I have to go through thousands of messages and this way it’s quite slow (approx. 1000 message * 3-4 sentence pieces * 50 corpus sentence ).

So my question more specifically: is there a way to pre-embed the corpus sentences and reuse the embeddings when I calculate the score? And also is there a way to batch compare an example sentence from user message with all corpus embedding at once?

I can utilize an rtx3090 for this task, so I have 24 gigs of memory to utilize.

Thank you in advance! I really appreciate the solution provided in the stackoverflow post also!

Topic		Replies	Views
Make bert inference faster 🤗Transformers	6	10753	September 16, 2021
Speed up Code relating to Tokenizer and BertForNextSentence Prediction 🤗Transformers	0	271	January 20, 2022
Reduce inference time with batches Beginners	0	414	September 14, 2021
Two sentences classification detail questions 🤗Transformers	0	390	June 2, 2022
Using Accelerated Inference API to produce sentense embeddings 🤗Transformers	16	2180	April 12, 2023

NLI speed up with batch and pre-embed corpus

Related topics