Model validation failed - Target is multiclass but average='binary'

As a primer, I’m trying to see if I gain any advantages to my text classification efforts using a LLM and in this case using Mistral-7b-0.1. I’m used the following blog as my base methodology with a few adjustments → blog/ at main · huggingface/blog · GitHub. The primary difference is the author is using binary classes whereas I have 30 classes in my dataset.

That said, I’m running into an issue where the results reporting fails after validation is complete and I receive the following error below. I did see a handful of other people experiencing the same issue but no solutions. I’m stumped at the moment so seeking guidance on a resolution. Any clues as to the issue?

ValueError                                Traceback (most recent call last)
Cell In[17], line 1
----> 1 mistral_trainer.train()

File ~/anaconda3/envs/mlenv/lib/python3.10/site-packages/transformers/, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553         hf_hub_utils.enable_progress_bars()
   1554 else:
-> 1555     return inner_training_loop(
   1556         args=args,
   1557         resume_from_checkpoint=resume_from_checkpoint,
   1558         trial=trial,
   1559         ignore_keys_for_eval=ignore_keys_for_eval,
   1560     )

File ~/anaconda3/envs/mlenv/lib/python3.10/site-packages/transformers/, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1934     self.control.should_training_stop = True
   1936 self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
-> 1937 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1939 if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
   1940     if is_torch_tpu_available():
   1941         # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)

File ~/anaconda3/envs/mlenv/lib/python3.10/site-packages/transformers/, in Trainer._maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   2269         metrics.update(dataset_metrics)
   1526         UserWarning,
   1527     )

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Previously, I did not set the average variable in the compute_metrics function just like in the blog post and thus I assumed that was causing the error - Target is multiclass but average='binary'. However when I do set it to average='weighted' in the compute_metrics function, I still get the same error result.

Here is my code -

import torch
import evaluate
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

num_classes = len(set(df['label']))

train_val_df, test_df = train_test_split(df, test_size=0.1, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.2, random_state=42)

data = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'val': Dataset.from_pandas(val_df),
    'test': Dataset.from_pandas(test_df)

for split in data.keys():
    data[split] = data[split].remove_columns('__index_level_0__')

# Extract class labels
class_labels = train_df['label']

# Calculate class weights
class_weights = compute_class_weight('balanced', classes=pd.unique(class_labels), y=class_labels)

# Create a dictionary with class labels and their corresponding weights
class_weight_dict = dict(zip(pd.unique(class_labels), class_weights))
sorted_keys = sorted(class_weight_dict.keys())
class_weight_dict = {key: class_weight_dict[key] for key in sorted_keys}

mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
MAX_LEN = 512
col_to_delete = ['text']

# Load Mistral 7B Tokenizer
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token

def mistral_preprocessing_function(examples):
    return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

mistral_tokenized_datasets =, batched=True, remove_columns=col_to_delete)
#mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("target", "label")

# Data collator for padding a batch of examples to the maximum length seen in the batch
mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)

mistral_model =  AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=mistral_checkpoint,
                                                                    device_map={"": 0}

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

mistral_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none", 

mistral_model = get_peft_model(mistral_model, mistral_peft_config)

def compute_metrics(eval_pred):

    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    print(f"predictions: {predictions}; labels: {labels}")
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}
class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor(list(class_weight_dict.values()), device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

mistral_model = mistral_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 1

training_args = TrainingArguments(
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,

mistral_trainer = WeightedCELossTrainer(


python and library versions used -

python: 3.10.13
torch: 2.1.2
evaluate: 0.4.0
pandas: 2.1.1
numpy: 1.26.2
datasets : 2.12.0
sklearn: 1.3.0
transformers: 4.35.2
peft: 0.7.1

I figured it out. I updated the evaluate package from 0.4.0 to 0.4.1 and that resolved the issue…

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.