Hi, this is my first time trying to make semantic search. In my case its asymetric semantic search and I am using msmarco-MiniLM-L-6-v3 to create embeddings. I have about 400k elements to embed.
The problem: my input text is way longer than the 512 tokens. Text is on average 3000 words (so even more tokens) From what I understand these could be a solution?
chunking (with overlap?) the text for each vector to embed and then do a mean pooling on the result? The problem is I have no idea how to do this. I am a beginner and this is my first time doing this so is there any code example /tutorial/steps to follow?
run a summarizer model like facebook/bart-large-cnn to sum up the text into the correct size of 512 tokens or less. But this I feel would cost more in computation/time and would loose context and granularity(?)
Any tips is appreciated. This is really confusing to me, especially because the only to know you did it wrong is to test the results.
You can follow the below approach for chunking strategy -
import numpy as np
from FlagEmbedding import FlagModel
from transformers import AutoTokenizer
class EmbeddingModel:
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance.model = FlagModel('BAAI/bge-large-en-v1.5', use_fp16=True) # Replace with your model
cls._instance.tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5') # Replace with your model's tokenizer
cls._instance.max_tokens = 512 # Your model's max tokens limit
cls._instance.overlap = 50 # no of tokens to be overlapped between chunks
return cls._instance
def get_embeddings(self, text):
total_tokens = self.get_tokens_count(text)
if total_tokens<=self.max_tokens:
return self.get_sentence_embedding(text)
else:
return self.get_chuncked_embeddings(text).to_list()
def get_tokens_count(self, text):
all_tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
return len(all_tokens)
def get_sentence_embedding(self, text):
embedding = self.model.encode(text).tolist()
return embedding
def get_chuncked_embeddings(self, text):
chunks = self.chunk_text(text)
embeddings = []
for chunk in chunks:
embedding = np.array(self.get_sentence_embedding(chunk))
embeddings.append(embedding)
return np.mean(embeddings, axis=0)
def chunk_text(self, text):
tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
chunks = []
for i in range(0, len(tokens), self.max_tokens - self.overlap):
chunk = tokens[i:i + self.max_tokens]
chunks.append(self.tokenizer.decode(chunk, skip_special_tokens=True))
if len(chunk) < self.max_tokens:
break
return chunks
Later you can use the embedding model like this -
import numpy as np
input_text = "I need the embedding of this text."
embedding_model = EmbeddingModel()
embedding = np.array(embedding_model.get_embeddings(input_text))