Hi @eddieparker,
in the pre-transformer era, it’s common to build customized models that jointly learn embeddings of these extra features and concatenate them with the input text representation. A representative paper is the follows:
Kikuchi, Yuta, et al. “Controlling Output Length in Neural Encoder-Decoders.” EMNLP. 2016
However, you can usually concatenate the extra features directly with the input text in transformers. E.g. 0.8 <sep> this is the input text. It’s important to know that Transformers doesn’t understand numbers. So you should bucket the numbers into a relatively small list of unique values so that the Transformer can learn the association between the feature and the prediction.
If you’re looking for some more complex architecture, take a look at the following paper:
Moreira, Gabriel de Souza P., et al. “Transformers with multi-modal features and post-fusion context for e-commerce session-based recommendation.” arXiv preprint arXiv:2107.05124 (2021).
Note: I’m speaking from my experience on language generation, but I think the classification problem should be able to use the features in a similar way.