Word, sentence or long context embedding?

Hi, I want to build a semantic / neural search over my documents using vector embeddings.

While I know I probably shouldn’t be using word level embeddings, I’m still confused by BERT vs. SBERT… (So basic I know…)

Similarly, ada-002 gives me 8k token context… If I’m comparing documents, that seems better, but how does it capture overall semantics across the 8k?

Finally, I’d like to try training my own SBERT (or ada-002?), but obviously (based on the above) I’m confused about that.

I’m working on building an evaluation set, which is obviously key to anything I try, but I’m looking for insight.

Many thanks!