Hey,

my goal is to embedd a set of paragraphs, the sentence transformer model that produces the embedding performs the tokenization internally, so it expects plain text as input. But since its max input size is given by tokens (348), i donâ€™t exactly know how many words i can put into the model since i dont know to how many tokens they will be converted.

I was told to use 128 words as an approximation but I would like to be more precise which means filling the model with exactly 348 tokens on each model.encode() call.

I thought about loading the exact tokenizer my model uses internally via tokenizer.from_pretrained() and then iterate through each paragraph, tokenize each word and count the resulting tokens and cut the text when 348 is reached. By this we exactly know how much of our paragraph we can fit into the model.

Would this be a viable approach or can anyone think of something more logical and efficient? Do people usually stick with approximations?

Best regards,

John