Train loss is decreasing, but accuracy remain the same

this is the train and development cell for multi-label classification task using Roberta (BERT). the first part is training and second part is development (validation). train_dataloader is my train dataset and dev_dataloader is development dataset. my question is: why train loss is decreasing step by step, but accuracy doesn’t increase so much? practically, accuracy is increasing until iterate 4, but train loss is decreasing until the last epoch (iterate). is this ok or there should be a problem?

train_loss_set = []
iterate = 4
for _ in trange(iterate, desc="Iterate"):
  model.train()

  train_loss = 0 
  nu_train_examples, nu_train_steps = 0, 0
  
  for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch)
    batch_input_ids, batch_input_mask, batch_labels = batch
    optimizer.zero_grad()
    output = model(batch_input_ids, attention_mask=batch_input_mask)
    logits = output[0]
    loss_function = BCEWithLogitsLoss() 
    loss = loss_function(logits.view(-1,num_labels),batch_labels.type_as(logits).view(-1,num_labels))
    train_loss_set.append(loss.item())    
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    nu_train_examples += batch_input_ids.size(0)
    nu_train_steps += 1

  print("Train loss: {}".format(train_loss/nu_train_steps))

###############################################################################

  model.eval()
  logits_pred,true_labels,pred_labels,tokenized_texts = [],[],[],[]

  # Predict
  for i, batch in enumerate(dev_dataloader):
    batch = tuple(t.to(device) for t in batch)
    batch_input_ids, batch_input_mask, batch_labels = batch
    with torch.no_grad():
      out = model(batch_input_ids, attention_mask=batch_input_mask)
      batch_logit_pred = out[0]
      pred_label = torch.sigmoid(batch_logit_pred)
      batch_logit_pred = batch_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()
      batch_labels = batch_labels.to('cpu').numpy()

    tokenized_texts.append(batch_input_ids)
    logits_pred.append(batch_logit_pred)
    true_labels.append(batch_labels)
    pred_labels.append(pred_label)

  pred_labels = [item for sublist in pred_labels for item in sublist]
  true_labels = [item for sublist in true_labels for item in sublist]
  threshold = 0.4
  pred_bools = [pl>threshold for pl in pred_labels]
  true_bools = [tl==1 for tl in true_labels]
  
  print("Accuracy is: ", jaccard_score(true_bools,pred_bools,average='samples'))
torch.save(model.state_dict(), 'bert_model')

and the outputs:

Iterate:   0%|          | 0/10 [00:00<?, ?it/s]

Train loss: 0.4024542534684801

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Jaccard is ill-defined and being set to 0.0 in samples with no true or predicted labels. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Accuracy is:  0.5806403013182674

Iterate:  10%|β–ˆ         | 1/10 [03:21<30:14, 201.64s/it]

Train loss: 0.2972540049911379
Accuracy is:  0.6091337099811676

Iterate:  20%|β–ˆβ–ˆ        | 2/10 [06:49<27:07, 203.49s/it]

Train loss: 0.26178574864264137
Accuracy is:  0.608361581920904

Iterate:  30%|β–ˆβ–ˆβ–ˆ       | 3/10 [10:17<23:53, 204.78s/it]

Train loss: 0.23612180122962365
Accuracy is:  0.6096717783158462

Iterate:  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 4/10 [13:44<20:33, 205.66s/it]

Train loss: 0.21416303515434265
Accuracy is:  0.6046892655367231

Iterate:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 5/10 [17:12<17:11, 206.27s/it]

Train loss: 0.1929110718982203
Accuracy is:  0.6030885122410546

Iterate:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 6/10 [20:40<13:46, 206.74s/it]

Train loss: 0.17280191068465894
Accuracy is:  0.6003766478342749

Iterate:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 7/10 [24:08<10:21, 207.04s/it]

Train loss: 0.1517329115446631
Accuracy is:  0.5864783427495291

Iterate:  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 8/10 [27:35<06:54, 207.23s/it]

Train loss: 0.12957811209705325
Accuracy is:  0.5818832391713747

Iterate:  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 9/10 [31:03<03:27, 207.39s/it]

Train loss: 0.11256680189521162
Accuracy is:  0.5796045197740114

Iterate: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [34:31<00:00, 207.14s/it]

This means you are overfitting (training loss diminished but no improvement in validation loss/accuracy) so you should try using any technique that helps reduce overfitting: weight decay, more dropout, data augmentation (if applicable)…

2 Likes

@sgugger - Following up on your comment, which techniques do you recommend while fine-tuning a BERT model for a sequence classification task? How can we, for instance, adjust weight decay and dropout in the architecture?

You can adjust the weight_decay in your TrainingArguments. For the dropout in the model, you can adjust it by passing it to you XxxModel.from_pretrained call. The exact name of the argument varies depending on the model, so you should check the documentation of the config of the model you are using.

Thanks for your explanation.

For those interested in testing BERT models with adjusted dropout settings, the following hidden_dropout_prob addition worked for me.

model = AutoModelForSequenceClassification.from_pretrained(model_name, 
num_labels=2, hidden_dropout_prob=0.2)
                                                                       
1 Like