Generate embeddings with custom (non-text) dataset

Long story short I have a dataset of items represented as variable length integer sequences (ie [34, 827, 4011, 5, ...]), analogous to text after tokenization and numericalization. I want to use masked language modeling to learn self-supervised embeddings of these integer sequences. What would be the best way to do this?

I can see how to rig up a BERT-style model that will randomly mask tokens and predict the masked tokens, but I’m stuck at the best way to generate the final embedding (the thing I actually want). Sure I can chuck a [CLS] token into each sequence, but I don’t know how useful that would be without the next sentence prediction task used by BERT. Maybe during inference, average token representations from one of the layers? Does anyone have ideas on this?

Extra info:
The data doesn’t have a “paired data” framework, so pair-based SSL isn’t possible.
There isn’t a clear concept of “data augmentation”, so augmentation-based SSL like BYOL or SimSiam are out.
Masked token prediction makes a lot of sense for this task, because the un-masked tokens have contextual information that should enable correct prediction of the masked token.