Can an EncoderModel be trained on top of a concatenation of BertModel [CLS] embeddings with additional input data using the transformers library?

Hello dear HuggingFace community,

I am currently trying to develop a BertForSequenceClassificationClass that gets the [CLS] vector of BertModel embeddings as input as well as additional data (ca. 50 dimensions). I concatenate the extra data with the [CLS] embedding and now need to train an Encoder to learn non-linear relationships in the input. The goal is to use this model for classifying different text blocks on Web pages, using both the text in a text block as well as its tag path represented as a sparse vector.

Is it possible to provide the EncoderModels with such an input concatenation? If it is, could anyone provide me with a snippet or link on how to implement and train it?

(Edit: Alternatively, if it is not possible to use EncoderModels provided by the HuggingFace transformers library, how can I use such input with a BiLSTM and subsequent Classification head?)

Thank you very much in advance!