Hello everyone,
I’m trying to do some sentiment analysis on the IMDB movie reviews dataset. First I trained a model based on GloVe embeddings followed by an LSTM layer, and then a fully connected feedforward layer. I implemented it with Pytorch and it works like a charm. Now I’m trying to replace the GloVe + LSTM by some transformer based model. I managed to do it and I chose DistilBERT as it is supposed to be lightweight (I’m training the model on my laptop which has no GPU). I kept the DistilBERT embedding frozen, in order, again, to minimize the computational cost. But the results are really bad. It basically looks like it is oscillating around random guess.
Here are a few metrics after 1 epoch:
Accuracy = 0.5201
F1 score binary = 0.6723
Recall score = 0.9898
Average precision score = 0.5090
Air under the ROC = 0.7421
Loss on validation set: 1040.9420166015625
Here are the same metrics after 5 epochs:
Accuracy = 0.6855
F1 score binary = 0.6725
Recall score = 0.6493
Average precision score = 0.6974
Air under the ROC = 0.7497
Loss on validation set: 480.90106201171875
and here are the same metrics after 11 epochs (after that I stopped):
Accuracy = 0.5939
F1 score binary = 0.3630
Recall score = 0.2327
Average precision score = 0.8251
Air under the ROC = 0.7330
Loss on validation set: 738.0289306640625
So not the magical bump in accuracy I was hoping for
Here is also the model I used:
class DistilBERTWithPoolingClf(nn.Module):
"""
Classifier based on HuggingFace Transformers implementation of DistillBERT,
using a basic DistillBERT layer with maxpooling on top of it.
"""
__name__ = "DistilBERTbase"
def __init__(self, keep_prob, seq_length):
super(DistilBERTWithPoolingClf, self).__init__()
self.DistilBERT = DistilBertModel.from_pretrained("distilbert-base-uncased")
self.DistilBERT.requires_grad_(False) # Embeddings are frozen
self.maxpool = nn.MaxPool1d(seq_length)
self.dropout = nn.Dropout(1 - keep_prob)
self.hidden2bin = nn.Linear(768, 2) # For Bi-LSTM
def forward(self, ids, mask, token_type_ids):
batch_size = ids.shape[0]
# Unlike for BERT (Hugging Face implementation), the forward method returns
# the embedding of every input token and there is no embedding of the CLS token
# (as far as I know)
hidden = self.DistilBERT(ids, attention_mask=mask, return_dict=False)
hidden = hidden[0]
hidden = hidden.permute(0, 2, 1)
hidden = self.maxpool(hidden)
hidden = self.dropout(hidden)
logits = self.hidden2bin(hidden.view(batch_size, 768))
return logits
I have a few questions:
- Am I naive trying to use these models without GPU? I was hoping that by freezing them, I’ll have only to train the feedforward layer and that this part would be accessible?
- Could it be that the results are that bad because the DistilBERT layer is frozen?
- Do you see some obvious mistake in my definition of the model? I am a complete beginner when it comes to Huggingface Transformers so I wouldn’t be surprised if there was any.
- Any other suggestion?