Text input bigger than max tokens length for semantic search embeddings

Hi, this is my first time trying to make semantic search. In my case its asymetric semantic search and I am using msmarco-MiniLM-L-6-v3 to create embeddings. I have about 400k elements to embed.

The problem: my input text is way longer than the 512 tokens. Text is on average 3000 words (so even more tokens) From what I understand these could be a solution?

  • chunking (with overlap?) the text for each vector to embed and then do a mean pooling on the result? The problem is I have no idea how to do this. I am a beginner and this is my first time doing this so is there any code example /tutorial/steps to follow?

  • run a summarizer model like facebook/bart-large-cnn to sum up the text into the correct size of 512 tokens or less. But this I feel would cost more in computation/time and would loose context and granularity(?)

Any tips is appreciated. This is really confusing to me, especially because the only to know you did it wrong is to test the results.

You can follow the below approach for chunking strategy -

import numpy as np
from FlagEmbedding import FlagModel
from transformers import AutoTokenizer

class EmbeddingModel:
    _instance = None

    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance.model = FlagModel('BAAI/bge-large-en-v1.5', use_fp16=True) # Replace with your model
            cls._instance.tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5') # Replace with your model's tokenizer
            cls._instance.max_tokens = 512 # Your model's max tokens limit
            cls._instance.overlap = 50 # no of tokens to be overlapped between chunks
        return cls._instance
    
    def get_embeddings(self, text):
        total_tokens = self.get_tokens_count(text)
        if total_tokens<=self.max_tokens:
            return self.get_sentence_embedding(text)
        else:
            return self.get_chuncked_embeddings(text).to_list()
    
    def get_tokens_count(self, text):
        all_tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
        return len(all_tokens)

    def get_sentence_embedding(self, text):
        embedding = self.model.encode(text).tolist()
        return embedding
    
    def get_chuncked_embeddings(self, text):
        chunks = self.chunk_text(text)
        embeddings = []
        for chunk in chunks:
            embedding = np.array(self.get_sentence_embedding(chunk))
            embeddings.append(embedding)
        return np.mean(embeddings, axis=0)

    def chunk_text(self, text):
        tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
        chunks = []
        for i in range(0, len(tokens), self.max_tokens - self.overlap):
            chunk = tokens[i:i + self.max_tokens]
            chunks.append(self.tokenizer.decode(chunk, skip_special_tokens=True))
            if len(chunk) < self.max_tokens:
                break
        return chunks

Later you can use the embedding model like this -

import numpy as np

input_text = "I need the embedding of this text."
embedding_model = EmbeddingModel()
embedding = np.array(embedding_model.get_embeddings(input_text))