Hi I am just using the transformer library for the first time.
My goal is to compare how likely it is that a sentence follows another sentence. For this I consider a list of 100 sentences. I want to calculate the probability for each sentence with every other sentence except itself.
Now it turns out that the following code takes a considerable amount of time to run. Is there any way that I can speed up the following code by, for example, applying the tokenizer/model function to the entire list. I am happy to receive performance hints.
Many thanks,
SecondBrother
def get_similarity_nsp_bert(top_x_results: list[str], comparing_results: list[str], model: BertForNextSentencePrediction, tokenizer: BertTokenizer) -> list[tuple[str, str]]:
matched_results = {}
for i, first_sent in enumerate(tqdm(top_x_results)):
similarities = []
for j, second_sent in enumerate(comparing_results):
if i!=j:
encoding = tokenizer(first_sent, second_sent, return_tensors='pt')
outputs = model(**encoding, labels=torch.LongTensor([1]))
similarities.append((j, outputs.logits))
sorted_similarities = sorted(similarities, key=lambda x: x[1].detach().numpy()[0][0], reverse=True)
matched_results[i] = sorted_similarities
return [(top_x_results[i], comparing_results[matched_results[i][0][0]]) for i in matched_results]