I have many series of trajectories `[[x1, y1, z1], [x2, y2, z2], ... [xn, yn, zn]]`

of objects I’ve been tracking in imaging. Some of the time points `[xi, yi, zi]`

are missing and I’d like to impute these coordinates `[x_hat, y_hat, z_hat]`

- a problem I see very similar to masked language modeling!

Conceptually the transformer makes sense but I am stuck on the most trivial step! Can numerical values be used as input to a transformer like BERT?

- I don’t need to “tokenize” my input, and that is part of my confusion
- another is my problem has no “vocabulary”, since the outputs are numerical values (normalized between 0 and 1).
- Do I have to use a special architecture (ie. BEiT?).