Image Features as Model Input

Hello All. I apologize for the basic question but for some reason I am having difficulty using image features as input to a huggingface model.

My data comes in the form of an image feature numpy array extracted by a 2D CNN but all the models seem to be built for text based input.

If anybody could point me in the right direction of an example or a code snippet I would greatly appreciate it!


the transformers models are designed for Natural Language Processing, ie text. What makes you think they would be good for image features?

I expect you could bypass the tokenizers and input numbers directly, but I’m not sure it would do anything useful. If you did want to do that, you would need to ensure that your numbers were in the right format. For example, a BERT model expects input that is a vector of 768 real numbers for each word-token, or rather a matrix of 768 x N real numbers, where N is number of word-tokens in a text.

What size is your image feature array?

The main trick of transformers is to use Attention mechanisms. It is certainly possible to use Attention in image recognition models, but without using transformers. See this article for an example

If you haven’t seen them already, you might find these articles useful.