Token classification for a non-textual data

I’m looking for an implementation of an architecture that performs token classification, but the input is not an integer that represents the vocabulary but a vector of numbers.
Basically, each token in the input is represented by a vector. Each token is already an embeddings vector.
How can this be achieved?

Expected behavior

Input vector of size 768 for each token. A sequence of such tokens of up to 512.
Maybe it is as simple as removing the layer
(word_embeddings): Embedding(50265, 768, padding_idx=1)?
In any case a link to the solution would be most helpful.