Trainer not passing all features of the dataset?

arlindkadra · July 19, 2024, 10:26am

Hello everyone,

I have a dataset (created from a generator) which in addition to the input, label, and attention matrix has another extra feature.

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx] 

train_dataset = Dataset.from_generator(gen, gen_kwargs={"torch_dataset": torch_train_dataset})

I am using the weighted trainer from the following tutorial:
https://github.com/huggingface/blog/blob/main/Lora-for-sequence-classification-with-Roberta-Llama-Mistral.md

Where I added an extra argument:
input.pop['extra_argument']

After creating the dataset, I check that it is loaded correctly and it has the correct number of features. However, if I check the input from the trainer, it does not have the extra feature anymore.

Is there any place where I am supposed to pass extra arguments to the trainer, maybe it is using only the hardcoded/typical ones.

arlindkadra · July 19, 2024, 1:08pm

Adding a custom collator additionally does not seem to resolve it:

class CustomDataCollatorWithPadding(DataCollatorWithPadding):
    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        # Extract unique_labels separately
        extra_argument = [feature.pop('extra_argument') for feature in features]

        # Use the parent class's __call__ method to handle padding
        batch = super().__call__(features)

        # Add unique_labels back into the batch
        batch['extra_argument'] = torch.tensor(extra_argument)

        return batch

arlindkadra · July 30, 2024, 12:59pm

As an update, the solution is to pass the following to the trainer:

remove_unused_columns=False

system · July 31, 2024, 12:59am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error/warning: Not all data has been set. Are you sure you passed all values? Beginners	3	1785	March 5, 2021
Invalid key for dataset -- is this a bug with Trainers or with my code? Intermediate	1	689	July 24, 2023
Evaluating your model on more than one dataset Beginners	3	2069	February 28, 2022
Fine tuning RoBerta got an unexpected keyword argument 'labels' Intermediate	2	977	May 1, 2024
Dataset expected by Trainer Beginners	5	8977	September 28, 2020

Trainer not passing all features of the dataset?

Related topics