DistilBERT and CLS token


I’m completely new to Huggingface Transformers. I apologize if my questions are already answered somewhere else, and if it’s the case, I would be glad if you could point me to the given documentation.

I’m trying to progress in NLP by training and testing different models to do sentiment analysis on the IMDB Movies Reviews data set. So I implement some custom sub classes of nn.Module. In the first one I used a BERTBase layer and used the embedding of the CLS token as the embedding of the sentence to classify it. The model was:

    class BERTBaseClassifier(nn.Module):
    Bi-LSTM on top of frozen embeddings initialized with GloVe vectors, followed by 1D max pooling
    on all the outputs of the Bi-LSTM layer.
    __name__ = "BERTbase"

    def __init__(self, keep_prob):
        super(BERTBaseClassifier, self).__init__()
        self.BERT = transformers.BertModel.from_pretrained("bert-base-uncased")
        self.BERT.requires_grad_(False)  # Embeddings are frozen
        self.dropout = nn.Dropout(1 - keep_prob)
        self.hidden2bin = nn.Linear(768, 2)  # For Bi-LSTM

    def forward(self, ids, mask, token_type_ids):
        batch_size = ids.shape[0]
        _, hidden = self.BERT(ids, attention_mask=mask, token_type_ids=token_type_ids)
        hidden = self.dropout(hidden)
        logits = self.hidden2bin(hidden.view(batch_size, 768))
        return logits

It was not working super well (I have to admit that I’m running it on the CPU of my laptop so it was taking around 8 hours for an epoch) and I thought it could be good to use the lighter version of the model: DistilBERT. More precisely, this one:


But I was a bit surprised to see that its output is different from the output of BERT. Unless I missed it, what the forward method of DistilBERT outputs is the (final) embeddings of all the input tokens. I use input sequence of length 300 (by padding the sentences) and the output has length 300. So I guess that there is no additional embedding for a CLS token.
I have three questions:

  1. Am I correct? Is there no way to get an embedding for this CLS token with this model?
  2. If yes, why is that so?
  3. Since what I’m looking for is an embedding of the sentence, am I correct to believe that the closest thing to a replacement in my model above of the BERTBase layer with something based on DistilBERT would be this sentence-transformers model: distilbert-base-nli-stsb-mean-tokens ?

Thank you for your help

Hi abercher,

it’s a few months since I used DistilBERT, but I’m sure I used a CLS token from it.

When you run the tokenizer, have you set add_special_tokens=True?

Hi rgwatwormhill,

Thank you for your answer. I didn’t specify any such parameters. My code using the tokenizer is only the instanciation:

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

and the usage in the Dataset subclass:

   class IMDBDistilBertDataset(Dataset):
    def __init__(self, cleaned_reviews, y_binary, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.cleaned_reviews = cleaned_reviews
        self.y_binary = y_binary
        self.max_len = max_len

    def __len__(self):
        return len(self.y_binary)

    def __getitem__(self, index):
        text = str(self.cleaned_reviews[index])
        text = " ".join(text.split())
        inputs = self.tokenizer.encode_plus(
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': self.y_binary[index]

I looked at the code, and if I’m not wrong, this parameter is set to True by default. But to be honest, I didn’t know that the parameter existed before you mentioned it. I guess I should look for documentation.

Thanks again