Text input bigger than max tokens length for semantic search embeddings

khrek · December 4, 2023, 3:24am

Hi, this is my first time trying to make semantic search. In my case its asymetric semantic search and I am using msmarco-MiniLM-L-6-v3 to create embeddings. I have about 400k elements to embed.

The problem: my input text is way longer than the 512 tokens. Text is on average 3000 words (so even more tokens) From what I understand these could be a solution?

chunking (with overlap?) the text for each vector to embed and then do a mean pooling on the result? The problem is I have no idea how to do this. I am a beginner and this is my first time doing this so is there any code example /tutorial/steps to follow?
run a summarizer model like facebook/bart-large-cnn to sum up the text into the correct size of 512 tokens or less. But this I feel would cost more in computation/time and would loose context and granularity(?)

Any tips is appreciated. This is really confusing to me, especially because the only to know you did it wrong is to test the results.

Xazx · May 29, 2024, 4:40am

You can follow the below approach for chunking strategy -

import numpy as np
from FlagEmbedding import FlagModel
from transformers import AutoTokenizer

class EmbeddingModel:
    _instance = None

    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance.model = FlagModel('BAAI/bge-large-en-v1.5', use_fp16=True) # Replace with your model
            cls._instance.tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5') # Replace with your model's tokenizer
            cls._instance.max_tokens = 512 # Your model's max tokens limit
            cls._instance.overlap = 50 # no of tokens to be overlapped between chunks
        return cls._instance
    
    def get_embeddings(self, text):
        total_tokens = self.get_tokens_count(text)
        if total_tokens<=self.max_tokens:
            return self.get_sentence_embedding(text)
        else:
            return self.get_chuncked_embeddings(text).to_list()
    
    def get_tokens_count(self, text):
        all_tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
        return len(all_tokens)

    def get_sentence_embedding(self, text):
        embedding = self.model.encode(text).tolist()
        return embedding
    
    def get_chuncked_embeddings(self, text):
        chunks = self.chunk_text(text)
        embeddings = []
        for chunk in chunks:
            embedding = np.array(self.get_sentence_embedding(chunk))
            embeddings.append(embedding)
        return np.mean(embeddings, axis=0)

    def chunk_text(self, text):
        tokens = self.tokenizer(text, padding=False, truncation=False, return_tensors=None)['input_ids']
        chunks = []
        for i in range(0, len(tokens), self.max_tokens - self.overlap):
            chunk = tokens[i:i + self.max_tokens]
            chunks.append(self.tokenizer.decode(chunk, skip_special_tokens=True))
            if len(chunk) < self.max_tokens:
                break
        return chunks

Later you can use the embedding model like this -

import numpy as np

input_text = "I need the embedding of this text."
embedding_model = EmbeddingModel()
embedding = np.array(embedding_model.get_embeddings(input_text))

Topic		Replies	Views
Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏 Beginners	3	1952	April 13, 2023
Token Classification Models on (Very) Long Text Models	8	11147	March 9, 2023
How to chunk a text such that it's exactly the max size of models input? 🤗Transformers	0	1865	December 29, 2023
I don't understand the difference between asymmetric retrieval, sentence similarity, and semantic search Beginners	2	6156	July 28, 2023
Word, sentence or long context embedding? Beginners	0	366	March 8, 2024

Text input bigger than max tokens length for semantic search embeddings

Related topics