It takes every column in the batch except the ones listed in label_names (defaults to ["labels"]) and pads / stacks them into tensors. Those become the inputs.
When Trainer calls model(**batch), each key that matches a parameter in the modelās forward signature is used. A key called labels (or whatever you listed in label_names) is treated as the targets and the model will compute a loss from it.
Thank you, that is much clearer than the docs. And exactly not the behavior I expected.
The dataset I am using also has an id field. Should this be excluded somehow? Itās just a leftover representation of someoneās sql database and not relevant to the llm.
Or is that what the tokenize function does, only including the field or fields I want? This is my version of the example code:
Again, these are just my guess(in case Iām wrong). But, nothing happens to extra columns unless you tell Hugging Face to do something with them.
When you call dataset.map(tokenize, batched=True), the mapping function only adds the keys it returns (e.g. input_ids, attention_mask) ā it does not delete anything that was already there.
At training time the default data collator tries to batch every remaining column.
Maybe try something like this⦠tokenized = raw_ds.map( tokenize, batched=True, remove_columns=[c for c in raw_ds.column_names if c not in ("label",)] )
That leaves you with only the inputs (and the labels), so id, synopsis, etc. never reach the model.
The tokenizer inserts the modelās [SEP] token between them, sets token_type_ids correctly, and truncates/pads both sides together.