Continue pre-training BERT

Hello, I have a small portion of label data, and a much bigger set of unlabeled observations. I want to use the unlabeled samples in order to continue the pre-training of BERT, and then built a classifier on top of it.

Following this post

I tried to use BertModel.from_pretrained(‘bert-base-uncased’), and specifically

    model = BertModel.from_pretrained(HF_BERT_MODEL)
    model.cuda()

    optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8 
                )
    
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model.zero_grad()

        result = model(b_input_ids,
                       token_type_ids=None,
                       attention_mask=b_input_mask,
                       return_dict=True)

        loss = result.loss
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

>>>'BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'loss'

I get the above error.
My question is, how to do the fine-tuning?

Loss is available only when labels are provided to the model.

Thanks for your answer. The question is, though, how can we train the embeddings (using gradient descent) if there is no loss?

I don’t understand the question. Don’t you need to provide labels in order to calculate a loss? You either provide labels to the model so it can calculate the loss or you calculate loss yourself using the output of the model.

So in your code above, for training I would expect to pass in the labels to model, the. get the loss as you are expecting and then call loss.backward

The idea was to continue the pre-training, which, according to BERT is masking words (around 15% of the provided texts) and trying to predict the masked tokens. To the best of my knowledge, this process is considered to be “self-supervised” and therefore you don’t implicitly provide labels, but instead they are inferred from the data. In this case, you still have a loss (otherwise how could you learn the embeddings).
From your question though, I understand that I might need to mask tokens myself and add the masked token as the label. Am I correct? Care you to refer me to a notebook that shows an example?

Or you can use the trainer class from huggingface as described in their guide

1 Like