Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏

anvill · April 11, 2023, 2:09pm

Hey everyone!

I hope you’re all doing great. I’m new to the community and need your expertise to help me navigate through a challenge I’m facing. I’m working on generating embeddings for some content, with the aim of populating a vector store to perform semantic search. I’ve already found a suitable model for my purpose and language, but I’m a bit uncertain about a specific aspect.

My main concern is determining the optimal or maximum sentence size for encoding each embedding. I understand that sentence length plays a crucial role, as having sentences that are too long might result in the loss of some “meaning” in the embedding space. On the other hand, if sentences are too short, they may not carry enough “meaning” to be useful.

I’m curious to know how to make the right decision. Is there a way to determine the optimal sentence size by examining the model’s training hyperparameters or tokenizer properties?

I would be immensely grateful for any insights or advice you can share on this topic. I know I have a lot to learn, and I’m eager to hear from all of you experienced folks here in the community. Thank you so much in advance for taking the time to help out a newbie like me!

Warm regards,

Andrea

Ryulord · April 12, 2023, 7:46am

You’ll want to make sure your tokenized sentences fit in the context length so they don’t get truncated but aside from that you’ll probably just want to try different things and see what works. There’s no rule of thumb that I’m aware of for this problem. If you have some ground truth for what the correct result of a vector search should be then evaluation is pretty straightforward, otherwise you might have to do A/B tests on users or even just eyeball it.

It’s also pretty standard practice to have your passages overlap so that the bit of text you care about for a particular search doesn’t get cut in half just in case you weren’t aware already.

anvill · April 12, 2023, 11:53am

Thanks for your reply @Ryulord!
What do you mean exactly by “context length”? I assume here you mean the maximum token count for generating the embedding. How do I find such number for a given sentence_transformer though?

Regarding the overlap: yep, done that! Figured that that could avoid “truncating” sentences mid-way, leading to a loss of actual context.

Ryulord · April 13, 2023, 3:09am

Yup. Transformers calls it max_position_embeddings and it can be found in the config for the model so you can use this code to check.

from transformers import AutoConfig

checkpoint = "sentence-transformers/all-MiniLM-L6-v2"
config = AutoConfig.from_pretrained(checkpoint)
print(f"Maximum context length for this model: {config.max_position_embeddings}")

Topic		Replies	Views
How do I search for sentence transformer models by context window/token length/word piece count? 🤗Hub	0	669	March 27, 2024
Text input bigger than max tokens length for semantic search embeddings Beginners	1	1595	May 29, 2024
How to chunk a text such that it's exactly the max size of models input? 🤗Transformers	0	1874	December 29, 2023
Word, sentence or long context embedding? Beginners	0	367	March 8, 2024
Use sentence transformers with different embeddings size 🤗Transformers	0	293	June 6, 2023

Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏

Related topics