Encoder-only Transformer (BERT-like) for Token Classification outside NLP


As the title suggests, is there an encoder-only Transformer base model for token classification that not only allows 1D text tokens, but also accepts tokens as n-dimensional feature vectors? The sequences do not need to respect causality, hence the encoder-only choice.

I was thinking of something like BERT with a token classification head, without the whole input embedding layers, as I would already feed the model with fixed feature vectors?

If there is no base model quite like this, could you suggest what would be the fastest way to accomplish this given the current state of the model zoo in the hub? I would basically need a BERT-like model for token classification that accepts a 3D tensor rather than a 2D matrix (batch size included).

In my use case I have 512-dimensional tokens out of sequences of game logs, and have to predict a binary label for all the tokens in the sequence.

Before developing my own simple encoder-only Transformer implementation, I was wondering if there was a more efficient implementation present in the hub.