the transformers models are designed for Natural Language Processing, ie text. What makes you think they would be good for image features?
I expect you could bypass the tokenizers and input numbers directly, but I’m not sure it would do anything useful. If you did want to do that, you would need to ensure that your numbers were in the right format. For example, a BERT model expects input that is a vector of 768 real numbers for each word-token, or rather a matrix of 768 x N real numbers, where N is number of word-tokens in a text.
What size is your image feature array?
The main trick of transformers is to use Attention mechanisms. It is certainly possible to use Attention in image recognition models, but without using transformers. See this article for an example https://arxiv.org/abs/2004.13621