What's a **fair** way to compute similarities for Contrastive Learning?

I am currently working with an alignment problem with a contrastive learning methodology. As in with contrastive learning, we have an scenario in which we have a query, generally the sample we are currently working on with, and a set of positive pairs and negative pairs.

The key of contrastive learning is working with the embedding space dimensions to build iteratively a better latent/embedding space for the model in which we want positive samples close and negative samples further away. Hence we compute a similarity between the query and the positive and negative keys respectively.

My current query has dimensions (1, 1030, 4096), where 1 is the number of samples (1 because we only have 1 query), 1030 is the sequence length and 4096 are the embedding dimensions.
Now suppose we have 3 positive keys, hence the dimensions would be (3, 1030, 4096), 1030 being the sequence length and 4096 the embedding dimensions.

I am sure we wish to compute the similarity with the query and each of the 3 posistive queries in this case, hence we reduce the similarity operation to two matrices of dimensions (1030, 4096).

query embeddings:            (1, 1030, 4096)    # (samples, sequence length, embedding dims)
positive keys embeddings:    (3, 1030, 4096)

What would It be a good way to compute the similarity? Without getting lost in the curse of dimensionality?

I see the sequence length as the words on a phrase and 4096 the values that represents each word/token into the embedding space, hence, computing MSE would be somewhat not meaningful for the process…

Should I aggregate along the sequence length and compute cosine similarity with the resulting vector of embedding dimensions?