I’ve often thought about use cases where you think of word or sentence features that you know must be helpful to the system. Features that you would typically use in an SVM or a shallow network. I would want to know if those features still have the ability to add to the performance of a pretrained language model. So rather than just fine-tuning the language model, what are good ways to integrate custom features into LM without pretraining from-scratch?
My guess is that you can just take the output from an LM and add a custom head on top that also takes in these other features. So basically the output of the LM serves as another set of features. This does not seem ideal though, since the final connections might be too shallow, I imagine that a better approach is possible that still involves finetuning the LM along side training the network that the custom features are part of. Any thoughts or best “tried and true” methods out there?
One of my students studied exactly this phenomenon in a recent submission to SemEval: “UoB at SemEval-2020 Task 12: Boosting BERT with Corpus Level Information.” (https://arxiv.org/abs/2008.08547)
Excerpts from the paper:
We hypothesise that deep learning models, especially those that use pre-trained embeddings and so are trained on a small number of epochs, can benefit from corpus level count information. We test this on Sub-Task A using an ensemble of BERT and TF-IDF which outperforms both the individual models.
For sub-task B, we hypothesise that these sentence representations can benefit from having POS information to help identify the presence of a target. To test this hypothesis, we integrate the count of part-of-speech (POS) tags with BERT. While this combination did outperform BERT, we found that a simpler modification to BERT (i.e. cost weighting, Section 3.5) outperforms this combination.
And in terms of how the model was built:
This ensemble model is created by concatenating the sentence representation of BERT to the features generated by the TF-IDF model before then using this combined vector for classification. In practice, this translates into calculating the TF-IDF vector for each sentence and concatenating it to the corresponding BERT output. This vector is then fed to a fully connected classification layer. Both BERT and the TF-IDF weights are updated during training.
Have you solved the question ? I have similar demands.
We’ve since built on the previous work in the paper “Incorporating Count-Based Features into Pre-Trained Models for Improved Stance Detection” (https://arxiv.org/pdf/2010.09078.pdf). The code for this work is available at https://github.com/Anushka-Prakash/RumourEval-2019-Stance-Detection/
This work outperforms a RoBERTa baseline and achieved state-of-the-art results in stance detection by solving these problems (from paper):
- Pre-trained models, such as BERT, are often trained for between 2 and 5 epochs during fine-tuning whereas simpler feature based models need to be trained for much longer. Our experiments show that a simple ensemble of these models results in over-fitting
- There are likely to be too many features to directly ensemble the raw features with pre-trained models (resulting in too much noise), a loss of important - task specific - information when using dimensionality reduction methods, and too few output classes to use only the outputs of a feature based model in an ensemble (lack of information).