How to chunk a text such that it's exactly the max size of models input?

johncrtz · December 29, 2023, 9:56am

Hey,

my goal is to embedd a set of paragraphs, the sentence transformer model that produces the embedding performs the tokenization internally, so it expects plain text as input. But since its max input size is given by tokens (348), i don’t exactly know how many words i can put into the model since i dont know to how many tokens they will be converted.

I was told to use 128 words as an approximation but I would like to be more precise which means filling the model with exactly 348 tokens on each model.encode() call.

I thought about loading the exact tokenizer my model uses internally via tokenizer.from_pretrained() and then iterate through each paragraph, tokenize each word and count the resulting tokens and cut the text when 348 is reached. By this we exactly know how much of our paragraph we can fit into the model.
Would this be a viable approach or can anyone think of something more logical and efficient? Do people usually stick with approximations?

Best regards,
John

Topic		Replies	Views
Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏 Beginners	3	1952	April 13, 2023
Question about maximum number of tokens Research	1	6183	February 9, 2021
Question on splitting input sequence Beginners	3	5572	June 14, 2022
How do I search for sentence transformer models by context window/token length/word piece count? 🤗Hub	0	661	March 27, 2024
Text input bigger than max tokens length for semantic search embeddings Beginners	1	1573	May 29, 2024

How to chunk a text such that it's exactly the max size of models input?

Related topics