Concatenate non string features to a BERT transformers model

I’m trying to do a sentiment analysis problem where I have text that I want to know if it’s negative or positive, but I also have metadata surrounding the text (users past sentiments, etc) that I’d like to incorporate into the training data.

Ideally I’d like to append these values to the last high-dimensional output of the BERT training before it has a classifier head slapped on it. Are there examples of this anywhere I can crib off of? I have the base distilbert example running, just trying to see how I can build on it.

1 Like

(If it matters, I’m using keras/tensorflow under the hood)

Hi @eddieparker,

in the pre-transformer era, it’s common to build customized models that jointly learn embeddings of these extra features and concatenate them with the input text representation. A representative paper is the follows:

Kikuchi, Yuta, et al. “Controlling Output Length in Neural Encoder-Decoders.” EMNLP. 2016

However, you can usually concatenate the extra features directly with the input text in transformers. E.g. 0.8 <sep> this is the input text. It’s important to know that Transformers doesn’t understand numbers. So you should bucket the numbers into a relatively small list of unique values so that the Transformer can learn the association between the feature and the prediction.

If you’re looking for some more complex architecture, take a look at the following paper:

Moreira, Gabriel de Souza P., et al. “Transformers with multi-modal features and post-fusion context for e-commerce session-based recommendation.” arXiv preprint arXiv:2107.05124 (2021).

Note: I’m speaking from my experience on language generation, but I think the classification problem should be able to use the features in a similar way.

Oh that’s a neat idea - and thanks for the response!

A couple of questions/clarifications:

When you write is that literally the string ? Or are you indicating I should pick some sort of arbitrary separator so long as it’s consistent?

And if I understand your comment about “Transformers doesn’t understand numbers” - I’m assuming you mean that the transformers tokenizer sees 0.8 as the characters ‘0’, ‘.’, and ‘8’, versus a float of value 0.8, so it’s in my best interest to have discrete ‘buckets’ of values so BERT can learn the semantic difference of a few ‘categories’ vs a high precision float?

If so I think I get it. Thanks I’ll chew on that.

My original thought was to set the num_labels to the num hidden states, and then merge that with another net that contains my float & categorical values, add a few hidden layers/dropout and then condense to my final num labels and use that. Do you have any thoughts about how that would work out?

Thank you!

-e-

Transformers doesn’t understand numbers

This I mean it doesn’t matter you put {0.2, 0.4, 0.6} or {A, B, C}. The model just learns that it’s a different category.

Of course, you can use a separate small feedforward net to learn the category embeddings. It’ll be very similar to the EMNLP 2016 paper I mentioned above. Some papers such as the following also do that with transformers. However, my experience is that transformers are much more versatile and powerful. Conditional training with bucketing is usually enough. I wouldn’t go for a customized architecture in practical applications.

The main advantage of using a separate network to learn the features is that you can feed continuous values (e.g., 0.84) instead of discrete ones. If that’s important to your application, you may consider using it instead.