I am hoping to train a transformer to predict a sequence of vector embeddings.
I have a target sequence and was thinking of doing something like Avg cosine distance or MSE of all vectors in the sequence. I haven’t seen these distance metrics being used like this before so thought I would post here if anyone has recommendation on better loss functions.
I am not sure if I should be doing something more thought out. For example averaging the cosine distance then would mean the loss function does not account for the order of the embeddings. Though since the target sequence is a sequence of Transformer embeddings which already gone through positional encoding at some point I also wonder if it is necessary.