Despite Low Training Loss, Model Can't Predict Training Set Correctly

Goal: Create a Multi-Label Text Classifier.

Dataset Description: Describe properties of a sentence, as it relates to specific features. Use case relates to Diversity, Equity, and Inclusion. For example, if a sentence/paragraph talks about race, gendery, bias, etc.

There are about 250 examples, with 15 labels to select from. On average, each example has about 4-6 labels.


  1. Select auto-encoder models that have classification tasks. So far, I have tried roberta-base, and distilled-bert.

  2. Prepare model for transfer learning/fine-tuning. I followed the advice in this forum, by utilizing the parameter problem_type="multi_label_classification" for both tokenizer and model.

tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=15,problem_type="multi_label_classification")

On model initalization, I can confirm that the classification head has changed to the num_labels that was specified. Below is the output from model initalization.

- classifier.out_proj.weight: found shape torch.Size([3, 1024]) in the checkpoint and torch.Size([15, 1024]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([15]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  1. Transfer-Learning/Fine Tuning: I first tried transfer learning by freezing all the origional model weights.
for name, param in model.named_parameters():
	if 'classifier' not in name: # classifier layer
		param.requires_grad = False

If I check which layers are being trained, it indeed does show only these layers.

(classifier): RobertaClassificationHead(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=1024, out_features=15, bias=True)

I want to make sure that the model is even able to learn the data that I give it, and I read one of the first steps to do so is purposely overfit on the data, and see if the model can accurately predict on the training data. Essentially, regurgitate the information you give it. If it can’t do that, there is no way it can generalize to other examples it hasn’t trained on.

I trained for about 100 epochs and achieving a loss of 0.01.

I predicted on an example from the training data, converted the logits to probabilities using a sigmoid function (since the multi-label classification uses BCEWithLogitsLoss), and finding which labels had a probability of > 0.5.

Surprisingly, it was unable to even correctly predict the label(s) that the training examples were trained on!

Here is an example.

Input = "Encourage faculty, staff and students to confront racism and the intersectionality of race and sexism."
Label_String = ["DEI Initiatives","Mentorship"] 
Label = [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0]
Output_Probability = [.3,.3,.2,.4,.3,.3,.3,.2,.1,.3,.2,.1,.3,.3,.3]

I would expect the output probability to be closer to the label.

I tried again with different ratios of which layers of the origional model to freeze, including 100% (only leaving the new classification head), 75%, 50%, and 0%. Only the 0% was able to train the model to predict on the training examples, but that defeats the purpose of transfer learning and fine tuning, and won’t generalize well anyways.

I also tried loading the model from: checkpoints, the final saved model file, and the model trained in-memory. All give the same results. The models does poorly on the training datset, even after achieving low training loss.

I recognize that in order for transfer learning or fine tuning to work, there are certain critera for the origional model and new data set, such as:

  1. The origional model should be trained on a task similar to the transfer learning task.


  1. How is the model able to get a low training loss, if the model has output probabilities much different than the labels for the training examples?
  2. Is the model weights (or other key model parameters) somehow different between training and prediction, such that the model that trains on the data is different than the model that predicts on the data?


  1. I’ve used both CPU and the MPS device, since I’m training on a M1 device.

Source Code

import argparse 
import pandas as pd
import re
import numpy as np
import os
import json
from transformers import (
import logging
import torch
from datasets import load_dataset
import numpy as np
import evaluate
from dotenv import load_dotenv

def train_model(document_type:str,checkpoint:bool,use_mps_device:bool,model_name:str,freeze_layer_ratio:float):
        dataset = load_dataset("parquet",data_files=f"data/{document_type}.parquet",split="train")
    except Exception as e:
        logging.error(f"data/{document_type}.parquet does not exist")
        raise Exception(f"data/{document_type}.parquet does not exist")
    dataset = dataset.train_test_split(0.1)

    tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
    mps_device = torch.device("mps")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True,return_tensors="pt").to(mps_device)

    tokenized_datasets =, batched=True,remove_columns=["text"])

    model = AutoModelForSequenceClassification.from_pretrained(model_name,ignore_mismatched_sizes=True, num_labels=len(dataset["train"]["labels"][0]),problem_type="multi_label_classification")
    total_params = len(list(model.named_parameters()))
    for i, (name, param) in enumerate(model.named_parameters()):
        if 'classifier' not in name and i < round(total_params*freeze_layer_ratio): # classifier layer
            param.requires_grad = False

    training_args = TrainingArguments(
        evaluation_strategy = IntervalStrategy.EPOCH,
        optim="adafactor", # also tried default adam
        save_total_limit = 3, # Only last 3 models are saved. Older ones are deleted.
        metric_for_best_model = 'accuracy',

    metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])

    def sig(x):
        return 1/(1 + np.exp(-x))

    def compute_metrics(eval_pred):
        threshold = 0.5
        logits, labels = eval_pred
        for logit,label in zip(logits,labels):
            predictions = []
            references = []
            for i,l in enumerate(label):
                # if l: # Only calculate accuracy based on the True labels
        return metric.compute()

    trainer = Trainer(
        callbacks = [EarlyStoppingCallback(early_stopping_patience=6)],


parser = argparse.ArgumentParser(description='Model Training Options')
parser.add_argument('-s','--save_data',action='store_true', help='Save the data from the excel file "Document AI Score"'),
parser.add_argument('-n','--model_name',default='roberta-large-mnli', help='Name of the model. Default roberta-large-mnli'),
parser.add_argument('-d','--document_type',default="all", help='Select the document type. If all document types, select "all"'),
parser.add_argument('-c','--checkpoint', action='store_true', help='Select checkpoint to start training from again.'),
parser.add_argument('-m','--use_mps_device',action='store_true', help='If training on an M1 MacOS device. Default TRUE.'),
parser.add_argument('-r','--freeze_layer_ratio',type=float,default=0.5, help='If training on an M1 MacOS device. Default TRUE.'),

def main():
    args = parser.parse_args()

    if args.save_data:
    if args.document_type == "all":
        with open("document_features.json", 'r') as openfile:
            document_features:dict = json.load(openfile)
        for document_type_ in list(document_features.keys()):

if __name__ == "__main__":