How Does Trainer Know Which Trianing Input and Labels to Use?

Hi, I have a dataset with the following features:

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

I can successfully give this dataset to trainer. However which section does it use as inputs? Which section does it use as labels?

1 Like

It is a shame no one replied, I am searching for the same information myself.

1 Like

From my understanding…

It takes every column in the batch except the ones listed in label_names (defaults to ["labels"]) and pads / stacks them into tensors. Those become the inputs.

When Trainer calls
model(**batch), each key that matches a parameter in the model’s forward signature is used. A key called labels (or whatever you listed in label_names) is treated as the targets and the model will compute a loss from it.

You can look more at Transformer’s trainer.py

Leave a like if this helped at all :slight_smile:

2 Likes

Thank you, that is much clearer than the docs. And exactly not the behavior I expected.

The dataset I am using also has an id field. Should this be excluded somehow? It’s just a leftover representation of someone’s sql database and not relevant to the llm.

Or is that what the tokenize function does, only including the field or fields I want? This is my version of the example code:

def tokenize(examples):
    return tokenizer(examples["synopsis"], padding="max_length", truncation=True)

If I also want to include a ā€˜name’ field do I just string append it to synopsis?

1 Like

Again, these are just my guess(in case I’m wrong). But, nothing happens to extra columns unless you tell Hugging Face to do something with them.

When you call dataset.map(tokenize, batched=True), the mapping function only adds the keys it returns (e.g. input_ids, attention_mask) – it does not delete anything that was already there.

At training time the default data collator tries to batch every remaining column.

Maybe try something like this…
tokenized = raw_ds.map( tokenize, batched=True, remove_columns=[c for c in raw_ds.column_names if c not in ("label",)] )

That leaves you with only the inputs (and the labels), so id, synopsis, etc. never reach the model.

The tokenizer inserts the model’s [SEP] token between them, sets token_type_ids correctly, and truncates/pads both sides together.

So your function could be something like…

def tokenize(examples): merged = examples["synopsis"] + " [NAME] " + examples["name"] return tokenizer(merged, padding="max_length", truncation=True)

1 Like