Text input bigger than max tokens length for semantic search embeddings

Hi, this is my first time trying to make semantic search. In my case its asymetric semantic search and I am using msmarco-MiniLM-L-6-v3 to create embeddings. I have about 400k elements to embed.

The problem: my input text is way longer than the 512 tokens. Text is on average 3000 words (so even more tokens) From what I understand these could be a solution?

  • chunking (with overlap?) the text for each vector to embed and then do a mean pooling on the result? The problem is I have no idea how to do this. I am a beginner and this is my first time doing this so is there any code example /tutorial/steps to follow?

  • run a summarizer model like facebook/bart-large-cnn to sum up the text into the correct size of 512 tokens or less. But this I feel would cost more in computation/time and would loose context and granularity(?)

Any tips is appreciated. This is really confusing to me, especially because the only to know you did it wrong is to test the results.