Finetuning for feature-extraction? I.e. unsupervised fine tuning?

I noticed the facebook/bart-large-mnli · Hugging Face model card doesn’t show the feature-extraction task under Train menu, but it is under the Deploy menu. I haven’t been able to find an example of fine tuning a feature-extraction model, so is fine tuning not an option for feature extraction tasks? If it is, I’d love to see a working example somewhere…the examples I’ve been able to find are all for supervised learning, which makes me wonder if one needs labelled data to do fine tuning?

Hey @MaximusDecimusMeridi, the term “feature extraction” usually means to extract or “pool” the last hidden states from a pretrained model. So fine-tuning a model for feature extraction is equivalent to fine-tuning the language model, e.g. via masked or autoregressive language modelling. (You can find a BERT-like example of fine-tuning here, and indeed one does not need any labelled data for this step).

For BART, the situation is a bit more complex because it is a seq2seq architecture, so you would likely need to frame your fine-tuning task in that manner (e.g. as a translation or summarization task).

Most applications that need feature extraction (e.g. neural search) perform best with encoder-based models like BERT and friends - I recommend checking out sentence-transformers (link) which provides many state-of-the-art models for these applications :slight_smile:

1 Like

Thank you, I am reading through the book now and it is fantastic. Found the feature extraction part very clear and helpful :+1:

@lewtun I have a question about this. So trying to summarize, If we take a model like distillbert for the purposes of extracting embeddings as features for a downstream task, I have three options:

  1. Use pretrained model out of the box with no fine tuning (as in chapter 2 “Transformers as Feature Extractors”)
  2. Fine-tune the model as a masked language model with the unlabelled corpus to update the vocabulary ( as in Fine-tuning a masked language model)
  3. Fine-tune the model with the labelled corpus, to ensure the embeddings encapsulate the information needed for separating the classes

And from each of these I can extract embeddings, and then use them as features in any classification task I need.

In my particular case I have ~1m unlabelled text descriptions, and about 50K labelled descriptions. Would it make sense to do all three here? I.e. get the pretrained model, expand the domain vocabulary as much as possible (mlm fine-tune), and then tune for classification (SequenceClassification fine tuning)? Would either be more likely to have more impact than the other in this case (mlm fine-tune vs classification fine-tune)? Would the order matter? I realize the answer may be to test it and evaluate but curious if there’s any rules of thumb.

Apply traditional feature engineering. Besides using the embeddings from Transformer models, we could also add features such as the length of the tweet or whether certain emojis or hashtags are present

Are there any resources how to best do this? Sorry if this is covered in the second half of the book! I am finding it very helpful so far :+1: :+1: Would the approach be like the sklearn example, where the embeddings are read into a dataframe and I just add additional features to it and then train a classifier? I’m curious how scaling etc is handled when combining other features with embeddings like that.

Also it feels like the sklearn classifier is doing the same job of the final activation layer in a network. Would there be a way to achieve this in a single neural network, instead of training one network, extracting embeddings from it and then using it to train a separate classifier model? So in the example from the book, one could add the tweet length to the embedding vector somehow and still treat it as a regular SequenceClassification problem? To avoid having to optimize them separately/sequentially. Thanks again for all the great resources here :pray:

Hi Maximus I thought I’d try and answer your question.

#1 is very computationally expensive and increasingly unnecessary. The question is how much is your data like the data the model was trained on. But even for very different domain data than BERT was originally trained on, e.g. Movie reviews, fine-tuning can have more than sufficient results: https://arxiv.org/pdf/1905.05583.pdf

In addition, there are tons of models on HF that have now been fine-tuned, possibly on data similar to yours, that you should check out before you consider pre-training the entire transformer yourself, even if you have those resources available, imo. Models - Hugging Face

#2 doesn’t really make sense to me. Fine-tuning implies adding a supervised layer to make the otherwise unsupervised model learn the task you want. In this process, the same previous layers are active and the masking and next-sentence self-attention mechanisms still occur, just now on your data, and with an added component to minimize the loss from the difference between its outputs and your labels. So I think maybe #2 and #3 are the same thing.

w/r/t the question of adding features. You have it correct. You can’t add features to the transformers themselves and try and fine-tune because they are language models. Put another way, it just doesn’t makes sense to try and learn the masked token in a sentence if the masked token is the length of the sentence or whatever. But yes, you can take the trained model and use it to encode your data, then add a dimension or however many you want, to add features for some other model that will do something with those - regression or gbm or svm or something with sklearn. Oh and yes I would normalize all of your features for such a model.

Hope that helps!

1 Like

#1 is very computationally expensive and increasingly unnecessary.

Did you mean training from scratch here? I was referring to using the pretrained model without any fientuning, so just download and apply distillbert for instance. So that would be the minimum amount of work?

So I think maybe #2 and #3 are the same thing.

The big difference here is the amount of labelled vs unlabelled data (1M vs 50K). If I limit myself to tuning on labelled data, the masking process will have a much smaller corpus to learn from. This is why I would want to do it first with the 1M unlabelled records, which will create a stronger general vocabulary for the model. Then I would use the final 50K labelled records to finetune further specific to the labels. In a nutshell, if I skip #2 the model will have a smaller vocabulary than if I include it.