How to accessing the input_ids in EvalPrediction.predictions in Seq2SeqTrainer?

alvations · November 2, 2022, 1:38am

When training a Seq2SeqTrainer model with evaluate and it looks something like:

mt_metrics = evaluate.combine(
    ["bleu", "chrf"]
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    outputs = mt_metrics.compute(predictions=predictions,
                             references=references)

    return outputs

model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")

training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=100,
    save_steps=500,
    eval_steps=1,
    max_steps=1_000_000,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=valid_data.with_format("torch"),
    eval_dataset=test_data.with_format("torch"),
    compute_metrics=compute_metrics,
)

trainer.train()

The EvalPrediction.predictions objects is exposed to compute_metrics, it contains the label_ids and the predictions ids but it doesn’t contain the input_ids, sometimes when training computing the metrics that requires the input_ids:

mt_metrics = evaluate.combine(
    ["bleu", "chrf", "comet"]
)

def compute_metrics(pred, input_ids):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    source = tokenizer.batch_decode(input_ids, skip_special_tokens=True)
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    outputs = mt_metrics.compute(predictions=predictions,
                             references=references, sources=sources)

    return outputs

Is there someway to include the input_ids in the EvalPrediction object when using Seq2SeqTrainer?

If there isn’t, could anyone help to point me to docs to rewrite to create my own custom Seq2SeqTrainer and EvalPrediction? Thank you in advance!

sgugger · November 2, 2022, 2:02pm

You have a training argument for that: include_inputs_for_metrics.

alvations · November 2, 2022, 3:15pm

Thanks for the prompt reply! The include_inputs_for_metrics is exactly what’s needed =)

inoormoq · November 20, 2022, 11:19am

I’m confused, can I know the difference between the three, input, prediction and reference? and what is the evaluation matrics that needs them all?

alvations · November 22, 2022, 1:23am

@inoormoq, here’s a few examples of metrics in the evaluate package that uses references. Hope the examples are self explanatory Huggingface Evaluate for MT Evaluations | Kaggle

inoormoq · November 25, 2022, 7:27pm

That’s very informative in many aspects. Thanks alot.

Topic		Replies	Views
Bert2Bert passing input_ids to compute_metrics through the Seq2SeqTrainingArguments 🤗Transformers	0	259	November 29, 2022
Trainer class, compute_metrics and EvalPrediction 🤗Transformers	6	14492	October 28, 2020
Input of compute_metrics in ASR model Beginners	2	1321	April 19, 2021
Popping `inputs[labels]` when self.label_smoother is not None (in trainer.py) Beginners	2	1278	November 11, 2021
Accessing labels in the compute_metrics function Beginners	4	93	January 15, 2025

How to accessing the input_ids in EvalPrediction.predictions in Seq2SeqTrainer?

Related topics