Can BERT take numerical values as input for masked time-series modeling?

I have many series of trajectories [[x1, y1, z1], [x2, y2, z2], ... [xn, yn, zn]] of objects I’ve been tracking in imaging. Some of the time points [xi, yi, zi] are missing and I’d like to impute these coordinates [x_hat, y_hat, z_hat] - a problem I see very similar to masked language modeling!

Conceptually the transformer makes sense but I am stuck on the most trivial step! Can numerical values be used as input to a transformer like BERT?

  • I don’t need to “tokenize” my input, and that is part of my confusion
  • another is my problem has no “vocabulary”, since the outputs are numerical values (normalized between 0 and 1).
  • Do I have to use a special architecture (ie. BEiT?).