Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding šŸ™

Hey everyone! :blush:

I hope you’re all doing great. I’m new to the community and need your expertise to help me navigate through a challenge I’m facing. I’m working on generating embeddings for some content, with the aim of populating a vector store to perform semantic search. I’ve already found a suitable model for my purpose and language, but I’m a bit uncertain about a specific aspect.

My main concern is determining the optimal or maximum sentence size for encoding each embedding. I understand that sentence length plays a crucial role, as having sentences that are too long might result in the loss of some ā€œmeaningā€ in the embedding space. On the other hand, if sentences are too short, they may not carry enough ā€œmeaningā€ to be useful.

I’m curious to know how to make the right decision. Is there a way to determine the optimal sentence size by examining the model’s training hyperparameters or tokenizer properties?

I would be immensely grateful for any insights or advice you can share on this topic. I know I have a lot to learn, and I’m eager to hear from all of you experienced folks here in the community. Thank you so much in advance for taking the time to help out a newbie like me! :raised_hands:

Warm regards,

Andrea

You’ll want to make sure your tokenized sentences fit in the context length so they don’t get truncated but aside from that you’ll probably just want to try different things and see what works. There’s no rule of thumb that I’m aware of for this problem. If you have some ground truth for what the correct result of a vector search should be then evaluation is pretty straightforward, otherwise you might have to do A/B tests on users or even just eyeball it.

It’s also pretty standard practice to have your passages overlap so that the bit of text you care about for a particular search doesn’t get cut in half just in case you weren’t aware already.

Thanks for your reply @Ryulord!
What do you mean exactly by ā€œcontext lengthā€? I assume here you mean the maximum token count for generating the embedding. How do I find such number for a given sentence_transformer though?

Regarding the overlap: yep, done that! Figured that that could avoid ā€œtruncatingā€ sentences mid-way, leading to a loss of actual context.

Yup. Transformers calls it max_position_embeddings and it can be found in the config for the model so you can use this code to check.

from transformers import AutoConfig

checkpoint = "sentence-transformers/all-MiniLM-L6-v2"
config = AutoConfig.from_pretrained(checkpoint)
print(f"Maximum context length for this model: {config.max_position_embeddings}")