Training classifier with frozen DistilBERT embeddings

Hello everyone,

I’m trying to do some sentiment analysis on the IMDB movie reviews dataset. First I trained a model based on GloVe embeddings followed by an LSTM layer, and then a fully connected feedforward layer. I implemented it with Pytorch and it works like a charm. Now I’m trying to replace the GloVe + LSTM by some transformer based model. I managed to do it and I chose DistilBERT as it is supposed to be lightweight (I’m training the model on my laptop which has no GPU). I kept the DistilBERT embedding frozen, in order, again, to minimize the computational cost. But the results are really bad. It basically looks like it is oscillating around random guess.

Here are a few metrics after 1 epoch:

Accuracy = 0.5201
F1 score binary = 0.6723
Recall score = 0.9898
Average precision score = 0.5090
Air under the ROC = 0.7421

Loss on validation set: 1040.9420166015625

Here are the same metrics after 5 epochs:

Accuracy = 0.6855
F1 score binary = 0.6725
Recall score = 0.6493
Average precision score = 0.6974
Air under the ROC = 0.7497

Loss on validation set: 480.90106201171875

and here are the same metrics after 11 epochs (after that I stopped):

Accuracy = 0.5939
F1 score binary = 0.3630
Recall score = 0.2327
Average precision score = 0.8251
Air under the ROC = 0.7330

Loss on validation set: 738.0289306640625

So not the magical bump in accuracy I was hoping for :sweat_smile:

Here is also the model I used:

class DistilBERTWithPoolingClf(nn.Module):
    """
    Classifier based on HuggingFace Transformers implementation of DistillBERT,
    using a basic DistillBERT layer with maxpooling on top of it.
    """
__name__ = "DistilBERTbase"

def __init__(self, keep_prob, seq_length):
    super(DistilBERTWithPoolingClf, self).__init__()
    self.DistilBERT = DistilBertModel.from_pretrained("distilbert-base-uncased")
    self.DistilBERT.requires_grad_(False)  # Embeddings are frozen
    self.maxpool = nn.MaxPool1d(seq_length)
    self.dropout = nn.Dropout(1 - keep_prob)
    self.hidden2bin = nn.Linear(768, 2)  # For Bi-LSTM

def forward(self, ids, mask, token_type_ids):
    batch_size = ids.shape[0]
    # Unlike for BERT (Hugging Face implementation), the forward method returns
    # the embedding of every input token and there is no embedding of the CLS token
    # (as far as I know)
    hidden = self.DistilBERT(ids, attention_mask=mask, return_dict=False)
    hidden = hidden[0]
    hidden = hidden.permute(0, 2, 1)
    hidden = self.maxpool(hidden)
    hidden = self.dropout(hidden)
    logits = self.hidden2bin(hidden.view(batch_size, 768))
    return logits

I have a few questions:

  1. Am I naive trying to use these models without GPU? I was hoping that by freezing them, I’ll have only to train the feedforward layer and that this part would be accessible?
  2. Could it be that the results are that bad because the DistilBERT layer is frozen?
  3. Do you see some obvious mistake in my definition of the model? I am a complete beginner when it comes to Huggingface Transformers so I wouldn’t be surprised if there was any.
  4. Any other suggestion?

Hi @abercher, regarding your questions:

  1. You can certainly use Transformers as feature extractors on a CPU - since the weights are frozen, you only need the forward pass which is relatively quick to compute
  2. My experience has generally been that you can get significantly worse results when using the last hidden states as features vs fine-tuning end-to-end (in some cases > 20 F1 points!). But this varies depending on the dataset / task, so I am not sure if it’s also true for IMDB.
  3. One thing that seems a bit odd is the hidden.permute(0, 2, 1) part of your forward pass - why do you do this?
  4. You might get better results by using the average of the unmasked hidden state, e.g. in numpy code:
  input_ids = torch.tensor(batch["input_ids"]).to(device)
  attention_mask = torch.tensor(batch["attention_mask"]).to(device)
  with torch.no_grad():
    last_hidden_state = model(input_ids, attention_mask).last_hidden_state
    last_hidden_state = last_hidden_state.cpu().numpy()
  # Use average of unmasked hidden states for classification
  lhs_shape = last_hidden_state.shape
  boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
  boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1)
  boolean_mask = boolean_mask.reshape(lhs_shape)
  masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1)
  batch["hidden_state"] = masked_mean.data

You could also see what the performance looks like with something simpler like logistic regression, e.g. first extract the features:

def forward_pass(batch):
  input_ids = torch.tensor(batch["input_ids"]).to(device)
  attention_mask = torch.tensor(batch["attention_mask"]).to(device)
  with torch.no_grad():
    last_hidden_state = model(input_ids, attention_mask, return_dict=True).last_hidden_state
    last_hidden_state = last_hidden_state.cpu().numpy()
  # Use average of unmasked hidden states for classification
  lhs_shape = last_hidden_state.shape
  boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
  boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1)
  boolean_mask = boolean_mask.reshape(lhs_shape)
  masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1)
  batch["hidden_state"] = masked_mean.data
  return batch

and then extract the hidden states from your tokenized dataset, e.g. imdb_enc:

imdb_enc = imdb_enc.map(forward_pass, batched=True, batch_size=16)
X_train = np.array(imdb_enc["train"]["hidden_state"])
X_test = np.array(imdb_enc["test"]["hidden_state"])
y_train = np.array(imdb_enc["train"]["label"])
y_test= np.array(imdb_enc["test"]["label"])

and then train a classifier

from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(n_jobs=-1, penalty="none")
lr_clf.fit(X_train, y_train)
lr_clf.score(X_valid, y_valid)

If the result is still bad then you might have to try something more elaborate like averaging over certain layers as was done in the BERT paper.

HTH!

1 Like

Hi lewtun,

Thank you so much for your answer! It’s very kind of you to take the time to write a detailed answer. I highly appreciate your help.

I’m working now, but I’ll look at your answer in detail this evening when I’m done with my job.

Thanks again

Hi @abercher, curiosity got the better of me so I ran the above suggestion on IMDB and got an accuracy of 86.6% on the test set :slight_smile:

You can find the hacky code here and it shouldn’t be too hard to adapt to your MLP classifier.

1 Like

Hi again @lewtun,

Thanks for your second answer. I’m looking forward to trying it. Probably this weekend because today was too tight for me to do it.

I looked at the code concerning this swapping of dimension in

        hidden = self.DistilBERT(ids, attention_mask=mask, return_dict=False)
        hidden = hidden[0]
        hidden = hidden.permute(0, 2, 1)
        hidden = self.maxpool(hidden)

and the reason is that the DistilBERT layer outputs a tensor of dimension (n_batch, input_sequence_length, 768) where 768 is just the size of individual token embeddings as DistilBERT outputs. As the nn.MaxPool1d layer takes the max over the last dimension, I swapped the dimension because I want the max pooling to be done on the “temporal” axis so in other words take the max over the embeddings obtained for the different input tokens.

Concerning this averaging of unmasked hidden states, I’m a bit confused. I know what masking is, at least the general idea, but I don’t understand why we wouldn’t want to have the embeddings of the (randomly) masked tokens into our average.

Also, is averaging a more common practice than taking the max for these kinds of models (the transformer based ones)?

Anyway, thanks a lot for your support. It was really great. I’ll try all this and come back to you. Sorry for the delay.

1 Like