Fine-tune BERT and Camembert for regression problem

I am fine tuning the Bert model on sentence ratings given on a scale of 1 to 9, but rather measuring its accuracy of classifying into the same score/category/bin as the judges, I just want BERT’s score on a continuous scale, like 1,1.1,1.2… to 9. I also need to figure out how to do this using CamemBERT as well. What are all changes to be made in BertForSequenceClassification and CamembertForSequenceClassification module and what are all the changes to be made in preprocessing (like encode_plus )?

Hi @sundaravel, you can check the source code for BertForSequenceClassification here. It also has code for regression problem.

Specifically for regression your last layer will be of shape (hidden_size, 1) and use MSE loss instead of cross entropy

1 Like

Hi @valhalla,

i just have created my account in this forum, and i can’t see the link you wrote in your commentary. It seems broken. Do you have another one link to get the source code?

Kind regards

Aah, yes. The dir structure is changed so the link is no more. You can now find the class here

Thank you very much!

I’m also trying to do regression using BERT and getting an error about Longs? Not sure what I’m doing wrong here.

from transformers import DistilBertForSequenceClassification, AdamW, BertConfig

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

config = BertConfig()
config.num_labels = 1

model = BertForSequenceClassification(config).from_pretrained('bert-base-cased')
model.to(device)
model.train()

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in dataloader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        print(input_ids)
        attention_mask = batch['attention_mask'].to(device)
        print(attention_mask)
        token_type_ids = batch['token_type_ids'].to(device)
        print(token_type_ids)
        labels = batch['labels'].to(device)
        print(labels)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, 
                        token_type_ids=token_type_ids, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

Here’s the error (I printed out some data as well):

tensor([[ 101, 1198, 4841,  ...,    0,    0,    0],
        [ 101, 2960, 2254,  ...,    0,    0,    0],
        [ 101, 2866,  182,  ...,    0,    0,    0],
        ...,
        [ 101,  178,  112,  ...,    0,    0,    0],
        [ 101, 9294, 1128,  ...,    0,    0,    0],
        [ 101, 1268, 1185,  ...,    0,    0,    0]], device='cuda:0')
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')
tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0')
tensor([0.2135, 0.9005, 0.4206, 0.2755, 0.5373, 0.5537, 0.2492, 0.4841, 0.8241,
        0.3545, 0.2813, 0.5674, 0.4098, 0.5857, 0.9476, 0.6094, 0.2778, 0.2974,
        0.3362, 0.3490, 0.9035, 0.7904, 0.4856, 0.1117, 0.3851, 0.7932, 0.9066,
        0.3630, 0.2709, 0.8578, 0.2255, 0.3292], device='cuda:0')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-66-2f7282efa3c9> in <module>()
     24         print(labels)
     25         outputs = model(input_ids=input_ids, attention_mask=attention_mask, 
---> 26                         token_type_ids=token_type_ids, labels=labels)
     27         loss = outputs[0]
     28         loss.backward()

5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:
-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss_forward

Any help is appreciated!

1 Like

[I’m guessing here]

‘target’ is probably ‘labels’. What type are your labels initially?

You could try specifying their type explicitly, using something like

tb_labels = batch[2].to(device, dtype = torch.float)

Try this link python - RuntimeError: expected scalar type Long but found Float - Stack Overflow

1 Like

The targets are floats. I’ll try this, thanks for the suggestion!

@rgwatwormhill Shoot, that didn’t work unfortunately.

r"""
    labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
        Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
        config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
        If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """

I have a couple of questions. The first is it seems odd that torch.LongTensor would make sense for a regression problem. Not sure how that is compatible. Second question is would I need to add an additional layer to BERT to be able to do regression? Or remove the logits layer and replace it?

1 Like

looks like the model is initialized incorrectly, for regression we need use num_labels=1, and you can do it using two ways

config = BertConfig.from("...", num_labels=1)
model = BertForSequenceClassificatio.from_pretrained("...", config=config)

or

model = BertForSequenceClassificatio.from_pretrained("...", num_labels=1)

creating the model from config and the again using from_pretrained will override the config params. So in your code the model still has num_labels=2

The first is it seems odd that torch.LongTensor would make sense for a regression problem

Yes, the doscstring should be corrected. But you can still pass float tensor for a regression problem.

2 Likes

@valhalla Thank you so much! Such a small detail haha. Got it working.

Well I got it “working” in that there are no errors now, but surprisingly to me, I am seeing the validation loss increase with every epoch. I’m not sure what I’m doing wrong.

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Device used for training: {device}')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=1)
model.to(device)
model.train()

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(10):
    for batch in tqdm.notebook.tqdm(train_dataloader):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        # token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, 
                        labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()
    validation_loss = validate(test_dataset, model)
    print(f"Validation loss in epoch {epoch}: {validation_loss}")

Here is my validate function:

def validate(test_dataset, model):
  torch.cuda.empty_cache()

  test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=128)
  total_loss = 0
  with torch.no_grad():
    for batch in tqdm.notebook.tqdm(test_dataloader):
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      # token_type_ids = batch['token_type_ids'].to(device)
      labels = batch['labels'].to(device)
      outputs = model(input_ids=input_ids, attention_mask=attention_mask, 
                      labels=labels)
      total_loss += outputs[0].item()
  return(total_loss)

Here is some output showing the validation loss increasing:

`Device used for training: cuda`
    100%

    2871/2871 [1:49:25<00:00, 2.29s/it]

    100%

    1014/1014 [09:34<00:00, 1.76it/s]

    Validation loss in epoch 0: 70428.39735794067

    100%

    2871/2871 [2:00:01<00:00, 2.51s/it]

    100%

    1014/1014 [09:47<00:00, 1.73it/s]

    Validation loss in epoch 1: 69269.67090129852

    100%

    2871/2871 [2:01:22<00:00, 2.54s/it]

    100%

    1014/1014 [10:04<00:00, 1.68it/s]

    Validation loss in epoch 2: 70188.32639312744

    100%

    2871/2871 [1:46:28<00:00, 2.23s/it]

    100%

    1014/1014 [09:35<00:00, 1.76it/s]

    Validation loss in epoch 3: 72369.78367424011

    100%

    2871/2871 [1:47:28<00:00, 2.25s/it]

    100%

    1014/1014 [10:30<00:00, 1.61it/s]

    Validation loss in epoch 4: 73700.79190158844

    100%

    2871/2871 [1:46:24<00:00, 2.22s/it]

    100%

    1014/1014 [09:24<00:00, 1.79it/s]

    Validation loss in epoch 5: 74986.73181152344

Is the training loss also increasing?

1 Like

@valhalla can you post a code snippet showing how to set the number of output labels?

@thecity2 can you show how you modified the loss function to something suitable for regression?

I have a quick question, do we need to scale output for example range between [0-10] for regression problems?

Hi @Mahsaseifikar ,

No. ( I don’t think so).

I am not an expert, and it is a year since I last looked at any Bert code, but I don’t remember ever having to think about what scale to use for the output in my regression problem.