Having the 'The model did not return a loss from the inputs, only the following keys: logits.' error only when predict_with_generate = True

I am trying to apply instruction tuning to a llama based model. I have two problems with it:

  1. When I set predict_with_generate = True, the training raises an error before the start:
Traceback (most recent call last):
  File "/dss/dsshome1/02/ra95kix2/seminar_fma/growth-vs-forgetting/src/utils/finetune.py", line 730, in <module>
  File "/dss/dsshome1/02/ra95kix2/seminar_fma/growth-vs-forgetting/src/utils/finetune.py", line 690, in train
    train_result = trainer.train()
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 2692, in compute_loss
    raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask,labels.
  1. When I just want to make a prediction (do_predict),
Traceback (most recent call last):
  File "/dss/dsshome1/02/ra95kix2/seminar_fma/growth-vs-forgetting/src/utils/finetune_v2.py", line 716, in <module>
  File "/dss/dsshome1/02/ra95kix2/seminar_fma/growth-vs-forgetting/src/utils/finetune_v2.py", line 691, in train
    prediction_output = trainer.predict(
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 216, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 3010, in predict
    output = eval_loop(
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer.py", line 3123, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 305, in prediction_step
    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
  File "/dss/dsshome1/02/ra95kix2/miniconda3/envs/clearning/lib/python3.11/site-packages/transformers/utils/generic.py", line 318, in __getitem__
    return inner_dict[k]
KeyError: 'loss'

I don’t know why it attempts to compute loss even though I set it to predict_with_generate. In the training, this is not required since I am already tuning with a dataset but during the evaluation steps (I added 1 evaluation after each epoch) it would be nice to see BLEU or SARI scores. For the prediction (I assume I can use my test data), I just want to assess inference performance so I solely need generative prediction of the model.

Here is the related part of my script:

model, tokenizer = get_accelerate_model(args, checkpoint_dir)
    model.config.use_cache = False
    print("Loaded model")
    data_module = make_data_module(tokenizer=tokenizer, args=args)
    task = args.task

    trainer = Seq2SeqTrainer(
        **{k: v for k, v in data_module.items() if k != "predict_dataset"},
        compute_metrics=lambda p: compute_metrics(
            eval_labels = p.label_ids,
            task=task  # Dynamically fetched from the config

    # Verifying the datatypes and parameter counts before training.
    print_trainable_parameters(args, model)
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total += v
    for k, v in dtypes.items():
        print(k, v, v / total)

    all_metrics = {"run_name": args.run_name}
    # Training
    if args.do_train:
        logger.info("*** Train ***")
        print('are we in train?')
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
    # Evaluation
    if args.do_eval:
        logger.info("*** Evaluate ***")
        print('are we in evaluate?')
        metrics = trainer.evaluate(metric_key_prefix="eval")
        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)
    # Prediction
    if args.do_predict:
        logger.info("*** Predict ***")
        if 'labels' not in data_module['predict_dataset'].column_names:
            logger.warning("No 'labels' column found in prediction dataset. Metrics like BLEU may not work.")
        prediction_output = trainer.predict(
        prediction_metrics = prediction_output.metrics
        predictions = prediction_output.predictions
        predictions = np.argmax(predictions, axis=-1)
        predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
        predictions = tokenizer.batch_decode(
            predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
        with open(os.path.join(args.output_dir, 'predictions.jsonl'), 'w') as fout:
            for i, example in enumerate(data_module['predict_dataset']):
                example['prediction_with_input'] = predictions[i].strip()
                example['prediction'] = predictions[i].replace(example['input'], '').strip()
                fout.write(json.dumps(example) + '\n')
        trainer.log_metrics("predict", prediction_metrics)
        trainer.save_metrics("predict", prediction_metrics)

    if (args.do_train or args.do_eval or args.do_predict):
        with open(os.path.join(args.output_dir, "metrics.json"), "w") as fout:
I think the error indicates that the model doesn’t return a loss when expected. This is common when the model’s configuration or the dataset is not set up properly. So how about doing this.

  • Set the labels in your dataset: The training process expects the labels field to compute the loss. Ensure your dataset includes the labels field for both training and evaluation datasets. For example:



dataset = dataset.map(lambda x: {'labels': x['target_column']})
  • Verify the model configuration: Ensure the model’s config aligns with the task. For example:



model.config.pad_token_id = tokenizer.pad_token_id
model.config.decoder_start_token_id = tokenizer.bos_token_id
  • Override compute_loss: If your model only returns logits, you can manually compute the loss. Extend the model class and implement compute_loss:



from transformers import Seq2SeqLMOutput
import torch.nn.functional as F

class CustomModelForSeq2Seq(model):
    def compute_loss(self, outputs, labels):
        logits = outputs.logits
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
        return loss

Then, use this custom model in your script.


Hey Alan, thanks for your support!

The thing is, I am using a code from someone’s repo. In this code there is a custom datacollator object such as:

class DataCollatorForCausalLM(object):
    tokenizer: transformers.PreTrainedTokenizer
    source_max_len: int
    target_max_len: int
    train_on_source: bool
    predict_with_generate: bool

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        # Debugging: Ensure keys are present
        assert all('input' in instance and 'labels' in instance for instance in instances), \
            "Each instance must have 'input' and 'labels' keys."

        # Extract elements
        sources = [f"{self.tokenizer.bos_token}{example['input']}" for example in instances]
        targets = [f"{example['labels']}{self.tokenizer.eos_token}" for example in instances]

        # Tokenize
        tokenized_sources = self.tokenizer(
        tokenized_targets = self.tokenizer(

        # Build input_ids and labels
        input_ids = []
        labels = []
        for tokenized_source, tokenized_target in zip(
            tokenized_sources['input_ids'], tokenized_targets['input_ids']
            combined_input = tokenized_source + tokenized_target
            if not self.predict_with_generate:
                if not self.train_on_source:
                        torch.tensor([IGNORE_INDEX] * len(tokenized_source) + tokenized_target)

        # Apply padding
        input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
        labels = pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX) if not self.predict_with_generate else None

        return {
            'input_ids': input_ids,
            'attention_mask': input_ids.ne(self.tokenizer.pad_token_id),
            'labels': labels,

It leaves labels blank (an empty list i mean) if predict_with_generate set to True. I thought this was expected since we let model to generate sentences and therefore don’t need labels anymore.

When I removed “if not self.predict_with_generate” conditions (therefore add labels within the datacollator), I saw that the inference started without any issue. I am wondering whether I need to keep these changes in Data collator…

