Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding šŸ™

Hey everyone! :blush:

I hope youā€™re all doing great. Iā€™m new to the community and need your expertise to help me navigate through a challenge Iā€™m facing. Iā€™m working on generating embeddings for some content, with the aim of populating a vector store to perform semantic search. Iā€™ve already found a suitable model for my purpose and language, but Iā€™m a bit uncertain about a specific aspect.

My main concern is determining the optimal or maximum sentence size for encoding each embedding. I understand that sentence length plays a crucial role, as having sentences that are too long might result in the loss of some ā€œmeaningā€ in the embedding space. On the other hand, if sentences are too short, they may not carry enough ā€œmeaningā€ to be useful.

Iā€™m curious to know how to make the right decision. Is there a way to determine the optimal sentence size by examining the modelā€™s training hyperparameters or tokenizer properties?

I would be immensely grateful for any insights or advice you can share on this topic. I know I have a lot to learn, and Iā€™m eager to hear from all of you experienced folks here in the community. Thank you so much in advance for taking the time to help out a newbie like me! :raised_hands:

Warm regards,

Andrea

Youā€™ll want to make sure your tokenized sentences fit in the context length so they donā€™t get truncated but aside from that youā€™ll probably just want to try different things and see what works. Thereā€™s no rule of thumb that Iā€™m aware of for this problem. If you have some ground truth for what the correct result of a vector search should be then evaluation is pretty straightforward, otherwise you might have to do A/B tests on users or even just eyeball it.

Itā€™s also pretty standard practice to have your passages overlap so that the bit of text you care about for a particular search doesnā€™t get cut in half just in case you werenā€™t aware already.

Thanks for your reply @Ryulord!
What do you mean exactly by ā€œcontext lengthā€? I assume here you mean the maximum token count for generating the embedding. How do I find such number for a given sentence_transformer though?

Regarding the overlap: yep, done that! Figured that that could avoid ā€œtruncatingā€ sentences mid-way, leading to a loss of actual context.

Yup. Transformers calls it max_position_embeddings and it can be found in the config for the model so you can use this code to check.

from transformers import AutoConfig

checkpoint = "sentence-transformers/all-MiniLM-L6-v2"
config = AutoConfig.from_pretrained(checkpoint)
print(f"Maximum context length for this model: {config.max_position_embeddings}")