I hope youāre all doing great. Iām new to the community and need your expertise to help me navigate through a challenge Iām facing. Iām working on generating embeddings for some content, with the aim of populating a vector store to perform semantic search. Iāve already found a suitable model for my purpose and language, but Iām a bit uncertain about a specific aspect.
My main concern is determining the optimal or maximum sentence size for encoding each embedding. I understand that sentence length plays a crucial role, as having sentences that are too long might result in the loss of some āmeaningā in the embedding space. On the other hand, if sentences are too short, they may not carry enough āmeaningā to be useful.
Iām curious to know how to make the right decision. Is there a way to determine the optimal sentence size by examining the modelās training hyperparameters or tokenizer properties?
I would be immensely grateful for any insights or advice you can share on this topic. I know I have a lot to learn, and Iām eager to hear from all of you experienced folks here in the community. Thank you so much in advance for taking the time to help out a newbie like me!
Youāll want to make sure your tokenized sentences fit in the context length so they donāt get truncated but aside from that youāll probably just want to try different things and see what works. Thereās no rule of thumb that Iām aware of for this problem. If you have some ground truth for what the correct result of a vector search should be then evaluation is pretty straightforward, otherwise you might have to do A/B tests on users or even just eyeball it.
Itās also pretty standard practice to have your passages overlap so that the bit of text you care about for a particular search doesnāt get cut in half just in case you werenāt aware already.
Thanks for your reply @Ryulord!
What do you mean exactly by ācontext lengthā? I assume here you mean the maximum token count for generating the embedding. How do I find such number for a given sentence_transformer though?
Regarding the overlap: yep, done that! Figured that that could avoid ātruncatingā sentences mid-way, leading to a loss of actual context.