Trainer not passing all features of the dataset?

Hello everyone,

I have a dataset (created from a generator) which in addition to the input, label, and attention matrix has another extra feature.

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx] 

train_dataset = Dataset.from_generator(gen, gen_kwargs={"torch_dataset": torch_train_dataset})

I am using the weighted trainer from the following tutorial:
https://github.com/huggingface/blog/blob/main/Lora-for-sequence-classification-with-Roberta-Llama-Mistral.md

Where I added an extra argument:
input.pop['extra_argument']

After creating the dataset, I check that it is loaded correctly and it has the correct number of features. However, if I check the input from the trainer, it does not have the extra feature anymore.

Is there any place where I am supposed to pass extra arguments to the trainer, maybe it is using only the hardcoded/typical ones.

Adding a custom collator additionally does not seem to resolve it:

class CustomDataCollatorWithPadding(DataCollatorWithPadding):
    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        # Extract unique_labels separately
        extra_argument = [feature.pop('extra_argument') for feature in features]

        # Use the parent class's __call__ method to handle padding
        batch = super().__call__(features)

        # Add unique_labels back into the batch
        batch['extra_argument'] = torch.tensor(extra_argument)

        return batch

As an update, the solution is to pass the following to the trainer:

remove_unused_columns=False

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.