Ideas for better cross-corpus similarity scoring

rmmullen · July 16, 2023, 2:15pm

I’m using SBERT to compute embedding for SEC 10-K filings and for patent grants.

The 10-K filing have what I would call mostly non-technological information, but a select few passages in the 10-Ks are relatively dense in technological information.

Currently, I use SBERT trained on patents to create extractive summaries of the 10-Ks, and then use SBERT to compute embeddings of the summaries. I then compute cosine similarities with SBERT embeddings from patents.

But it isn’t obvious to me that SBERT trained on patents is optimal for extracting the passages of 10-Ks with the highest density of technological language.

How can I do this better? Any thoughts and ideas are greatly appreciated!

Topic		Replies	Views
Word, sentence or long context embedding? Beginners	0	367	March 8, 2024
Using Cross-Encoders to calculate similarities among documents Models	3	3717	December 3, 2020
STS on a niche domain Beginners	0	197	December 23, 2023
Huggingface datasets, faiss, sbert and cosine similarity 🤗Datasets	1	1000	January 3, 2023
Vector search returns almost random results Models	3	477	February 10, 2024

Ideas for better cross-corpus similarity scoring

Related topics