I’m completely new to Huggingface Transformers. I apologize if my questions are already answered somewhere else, and if it’s the case, I would be glad if you could point me to the given documentation.
I’m trying to progress in NLP by training and testing different models to do sentiment analysis on the IMDB Movies Reviews data set. So I implement some custom sub classes of nn.Module. In the first one I used a BERTBase layer and used the embedding of the CLS token as the embedding of the sentence to classify it. The model was:
class BERTBaseClassifier(nn.Module): """ Bi-LSTM on top of frozen embeddings initialized with GloVe vectors, followed by 1D max pooling on all the outputs of the Bi-LSTM layer. """ __name__ = "BERTbase" def __init__(self, keep_prob): super(BERTBaseClassifier, self).__init__() self.BERT = transformers.BertModel.from_pretrained("bert-base-uncased") self.BERT.requires_grad_(False) # Embeddings are frozen self.dropout = nn.Dropout(1 - keep_prob) self.hidden2bin = nn.Linear(768, 2) # For Bi-LSTM def forward(self, ids, mask, token_type_ids): batch_size = ids.shape _, hidden = self.BERT(ids, attention_mask=mask, token_type_ids=token_type_ids) hidden = self.dropout(hidden) logits = self.hidden2bin(hidden.view(batch_size, 768)) return logits
It was not working super well (I have to admit that I’m running it on the CPU of my laptop so it was taking around 8 hours for an epoch) and I thought it could be good to use the lighter version of the model: DistilBERT. More precisely, this one:
But I was a bit surprised to see that its output is different from the output of BERT. Unless I missed it, what the forward method of DistilBERT outputs is the (final) embeddings of all the input tokens. I use input sequence of length 300 (by padding the sentences) and the output has length 300. So I guess that there is no additional embedding for a CLS token.
I have three questions:
- Am I correct? Is there no way to get an embedding for this CLS token with this model?
- If yes, why is that so?
- Since what I’m looking for is an embedding of the sentence, am I correct to believe that the closest thing to a replacement in my model above of the BERTBase layer with something based on DistilBERT would be this sentence-transformers model: distilbert-base-nli-stsb-mean-tokens ?
Thank you for your help