Text-to-feature FinBERT for regression

I need to make a feature extractor for a project, so I am able to translate a given financial statement (text) into a vector that can be used as features in my main problem. I am currently doing revenue forecasting. I use historical fundamentals data in addition to stock prices in order to predict revenue growth for next quarter (regression problem). In addition I use text data (financial statements) where I want to use BERT in order to get new features for my regression model. That is, the vector from the BERT feature extraction will later be combined with several other values (fundamentals and stock price data) for the final prediction (next quarter revenue growth) in e.g. a random forest or XGBoost model.

I want to try both a FINE-TUNED FinBERT model and a PRE-TRAINED FinBERT MODEL and compare. But how do I fine-tune the FinBERT model on my dataset (regression problem) and then use that new FinBERT model to do the feature extraction?

1 Like

You can easily get a feature vector for a given piece of text as follows:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer("ProsusAI/finbert")
model = BertModel.from_pretrained("ProsusAI/finbert")

text = "hello world"
encoding = tokenizer(text, return_tensors="pt")

# forward pass
outputs = model(**encoding)

# get feature vector 
feature_vector = outputs.last_hidden_state[:,0,:]

Here I’m taking the final hidden state of the [CLS] token, which serves as a good representation of an entire piece of text. This is a vector of size (768,) for BERT-base-sized models.

I’m not sure what you mean by fine-tuning a BERT model on your dataset. You can fine-tune BERT on a regression problem, where the inputs are text and the outputs are floats. Is this what you want to do?

1 Like

Thanks! I want to do something similar to what is discussed here: How to build a Text-to-Feature Extractor based on Fine-Tuned BERT Model · Issue #1323 · huggingface/transformers · GitHub

But instead of using the text features as input in a classification problem I want to use them as input for regression problem.