Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏

Ryulord · April 12, 2023, 7:46am

You’ll want to make sure your tokenized sentences fit in the context length so they don’t get truncated but aside from that you’ll probably just want to try different things and see what works. There’s no rule of thumb that I’m aware of for this problem. If you have some ground truth for what the correct result of a vector search should be then evaluation is pretty straightforward, otherwise you might have to do A/B tests on users or even just eyeball it.

It’s also pretty standard practice to have your passages overlap so that the bit of text you care about for a particular search doesn’t get cut in half just in case you weren’t aware already.

Topic		Replies	Views
Distilbert-base-nli-stsb-mean-tokens OOM encoding sentences of 100K docs Beginners	4	688	February 9, 2021
Why pipeline can handle longer sentence than max_position_embeddings? Beginners	0	227	September 4, 2022
Are Word Embeddings by BERT generated for long sequences better than ones generated for short sequences? 🤗Transformers	0	240	March 29, 2022
How to change parameters of pre-trained longformer model from huggingface Beginners	0	982	August 2, 2021
Generating sentence embeddings from pretrained transformers model Intermediate	1	1097	January 22, 2021

Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏

Related topics