How Does Trainer Know Which Trianing Input and Labels to Use?

uygarkurt · March 5, 2023, 3:27pm

Hi, I have a dataset with the following features:

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

I can successfully give this dataset to trainer. However which section does it use as inputs? Which section does it use as labels?

thetraintomars · June 10, 2025, 1:09pm

It is a shame no one replied, I am searching for the same information myself.

Mdrnfox · June 10, 2025, 1:27pm

From my understanding…

It takes every column in the batch except the ones listed in label_names (defaults to ["labels"]) and pads / stacks them into tensors. Those become the inputs.

When Trainer calls
model(**batch), each key that matches a parameter in the model’s forward signature is used. A key called labels (or whatever you listed in label_names) is treated as the targets and the model will compute a loss from it.

You can look more at Transformer’s trainer.py

Leave a like if this helped at all

thetraintomars · June 10, 2025, 1:31pm

Thank you, that is much clearer than the docs. And exactly not the behavior I expected.

The dataset I am using also has an id field. Should this be excluded somehow? It’s just a leftover representation of someone’s sql database and not relevant to the llm.

Or is that what the tokenize function does, only including the field or fields I want? This is my version of the example code:

def tokenize(examples):
    return tokenizer(examples["synopsis"], padding="max_length", truncation=True)

If I also want to include a ‘name’ field do I just string append it to synopsis?

Mdrnfox · June 10, 2025, 1:51pm

Again, these are just my guess(in case I’m wrong). But, nothing happens to extra columns unless you tell Hugging Face to do something with them.

When you call dataset.map(tokenize, batched=True), the mapping function only adds the keys it returns (e.g. input_ids, attention_mask) – it does not delete anything that was already there.

At training time the default data collator tries to batch every remaining column.

Maybe try something like this…
tokenized = raw_ds.map( tokenize, batched=True, remove_columns=[c for c in raw_ds.column_names if c not in ("label",)] )

That leaves you with only the inputs (and the labels), so id, synopsis, etc. never reach the model.

The tokenizer inserts the model’s [SEP] token between them, sets token_type_ids correctly, and truncates/pads both sides together.

So your function could be something like…

def tokenize(examples): merged = examples["synopsis"] + " [NAME] " + examples["name"] return tokenizer(merged, padding="max_length", truncation=True)

Topic		Replies	Views
Trainer API Trianing Not Happening Beginners	0	214	April 8, 2023
Dataset Object without ClassLabel 🤗Datasets	3	1096	March 8, 2023
Creating Trainer object is deleting my 'labels' feature Beginners	3	1450	January 21, 2021
Column names of custom dataset for use with trainer Beginners	3	5433	March 31, 2024
Label 2 id not working Beginners	1	181	June 12, 2025

How Does Trainer Know Which Trianing Input and Labels to Use?

Related topics