Continue pre-training BERT

DHar2023 · November 13, 2023, 10:42am

Hello, I have a small portion of label data, and a much bigger set of unlabeled observations. I want to use the unlabeled samples in order to continue the pre-training of BERT, and then built a classifier on top of it.

Following this post

I tried to use BertModel.from_pretrained(‘bert-base-uncased’), and specifically

    model = BertModel.from_pretrained(HF_BERT_MODEL)
    model.cuda()

    optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8 
                )
    
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model.zero_grad()

        result = model(b_input_ids,
                       token_type_ids=None,
                       attention_mask=b_input_mask,
                       return_dict=True)

        loss = result.loss
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

>>>'BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'loss'

I get the above error.
My question is, how to do the fine-tuning?

panigrah · November 13, 2023, 12:16pm

Loss is available only when labels are provided to the model.

DHar2023 · November 13, 2023, 1:24pm

Thanks for your answer. The question is, though, how can we train the embeddings (using gradient descent) if there is no loss?

panigrah · November 13, 2023, 2:13pm

I don’t understand the question. Don’t you need to provide labels in order to calculate a loss? You either provide labels to the model so it can calculate the loss or you calculate loss yourself using the output of the model.

So in your code above, for training I would expect to pass in the labels to model, the. get the loss as you are expecting and then call loss.backward

DHar2023 · November 13, 2023, 2:42pm

The idea was to continue the pre-training, which, according to BERT is masking words (around 15% of the provided texts) and trying to predict the masked tokens. To the best of my knowledge, this process is considered to be “self-supervised” and therefore you don’t implicitly provide labels, but instead they are inferred from the data. In this case, you still have a loss (otherwise how could you learn the embeddings).
From your question though, I understand that I might need to mask tokens myself and add the masked token as the label. Am I correct? Care you to refer me to a notebook that shows an example?

panigrah · November 13, 2023, 6:34pm

Or you can use the trainer class from huggingface as described in their guide

Topic		Replies	Views
Continue pre-training Greek BERT with domain specific dataset 🤗Transformers	10	4658	January 4, 2023
Calculating accuracy during fine-tuning the BERTForMaskedLM 🤗Transformers	6	2772	October 1, 2020
How to do unsupervised fine-tuning? 🤗Transformers	1	6963	January 29, 2021
Fine tune Masked Language Model on custom dataset Beginners	5	6066	August 20, 2020
Pre-training BERT Models	1	382	May 21, 2024

Continue pre-training BERT

Related topics