Metadata in batches

thistlillo · January 30, 2025, 9:56am

I am trying to train a very simple seq-2-seq model, with training data composed of the usual input_ids, labels and attention_mask plus an additional column that allows me to go from the batch to the original dataset (example_id).

In order to associate the evaluation metrics to the original raw dataset, I have added an “example_id” that allows me to study the behaviour of the model on specific classes of input data points. I keep this column in the batch using this custom collation:

class MyCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        example_ids = [feature["example_id"] for feature in features] 

        features = [{k: v for k, v in feature.items() if k != "example_id"} for feature in features]

        # standard collator
        batch = super().__call__(features, return_tensors="pt")

        assert type(example_ids[0]) == int, "problem with example ids in batching"
        batch["example_id"] = example_ids
        
        return batch

When I pass batch made as described above to a standard Seq2SeqTrainer, I get an error related to an unexpected key in the batch:

T5ForConditionalGeneration.forward() got an unexpected keyword argument 'example_id'

Is there a simple way to make the Trainer accept metadata fields in the batch or the only way is to subclass the trainer?

Topic		Replies	Views
Extremely confusing or non-existent documentation about the Seq2Seq trainer Beginners	1	4449	December 16, 2021
How to use Seq2SeqTrainer (Seq2SeqDataCollator) in v4.2.1 🤗Transformers	5	2564	January 20, 2021
How to custom batch sentences and graphs 🤗Transformers	0	381	January 4, 2023
Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"? 🤗Transformers	5	3344	December 29, 2020
The difference between Seq2SeqDataset.collate_fn and Seq2SeqDataCollator._encode Beginners	2	1303	October 24, 2020

Metadata in batches

Related topics