Combining vectors when using contextual word embeddings with large datasets

I’m interested in using contextual word embeddings generated by a transformer-based model to explore the similarity of certain words in a large dataset.

As my dataset is much larger than the max tokens allowed in most transformer models, presumably I would need to break the dataset down into individual sentences & feed them into the model. That would give me a list of word embeddings per sentence. What I’m struggling to understand is how I could then best translate this list into meaningful embeddings per word across the whole dataset.

The immediately obvious approach would be to find the average embedding for each word. However, the point of contextual embedding is that it can identify different uses/meanings for the same word. ‘Bank’ as a noun and ‘bank’ as an adjective may have very different embeddings and therefore I’m not sure that the average would have a great deal of meaning. Is this a genuine concern? Is there a better approach?

In such a use case is there any value in using a contextual transformer model over a static one?