RemoveColumnsCollator is removing all columns

anujchopra · August 25, 2023, 5:52am

RemoveColumnsCollator is removing all keys from variable “features”. No matter what model I choose or which dataset I choose. The input to the function transformers.trainer_utils._remove_columns is a tokenized string which is a dictionary with “input_ids” and “labels” as keys. But output is always None as it removes any key whose name is not in signature columns ( which is always [ args, kwargs, label, label_ids ] ) . This is resulting in null input to padding function which throws error.

    def __call__(self, features: List[dict]):
        #features = [self._remove_columns(feature) for feature in features]
        return self.data_collator(features)

I had to comment this line but I do not know the implications. It is ok to do so? And if this is the expected behavior of this function?

fmaschhur · January 24, 2024, 3:17pm

I faced the same problem. The corresponding error message I get is

ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['attention_mask']

Deep diving into the library code of transformers and torch I’ve found the following solution:
The function being executed, which you have commented, is the following:
(taken from transformers/trainer_utils.py:686)

    def _remove_columns(self, feature: dict) -> dict:
        if not isinstance(feature, dict):
            return feature
        if not self.message_logged and self.logger and self.model_name:
            ignored_columns = list(set(feature.keys()) - set(self.signature_columns))
            if len(ignored_columns) > 0:
                dset_description = "" if self.description is None else f"in the {self.description} set"
                self.logger.info(
                    f"The following columns {dset_description} don't have a corresponding argument in "
                    f"`{self.model_name}.forward` and have been ignored: {', '.join(ignored_columns)}."
                    f" If {', '.join(ignored_columns)} are not expected by `{self.model_name}.forward`, "
                    " you can safely ignore this message."
                )
                self.message_logged = True
        return {k: v for k, v in feature.items() if k in self.signature_columns}

As the log message states, we run into this problem, because we are trying to train a model, whose forward function does not use corresponding parameter names. For me this happened because I am trying to train a custom model with a custom forward function:

def forward(self, tokens, attention_mask): ...

So, the _remove_columns function removed all entries in the fetched datapoint, that do not correspond to 'tokens' or 'attention_mask', therefore raising the ValueError mentioned above. Changing the forward function to

def forward(self, input_ids, attention_mask): ...

has solved this problem for me.

fmaschhur · January 24, 2024, 4:27pm

Another problem popped up using the method I suggested, which I want to quickly outline:
The dataset still has all other columns removed, so in the example I gave above, dataset entries only include the values for input_ids and attention_mask. The problem with this is that the loss function also only gets these values. Therefore, to be able to gain access to the labels of your dataset you’d need to add an empty labels = None parameter to your models forward function, which doesn’t seem right. This might be a bug or maybe there is another more pythonic way of telling the Trainer to not remove columns from the dataset.

fmaschhur · January 25, 2024, 2:20pm

A look into the official implementation of BertForSequenceClassification clears this up. They indeed use a lot of different parameters in the models forward function:

def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]: ...

The labels parameter is not directly used in the calculation of the models logits but rather afterwards. If given, it is used to calculate and return losses.

Therefore, if you implement a custom model you should give all columns you will need from your dataset into the models forward function parameter definition (which may include the labels column).

sheaako · August 12, 2024, 6:34pm

I had a similar issue.
They key for me was to add this to my TrainingArguments:

training_args = TrainingArguments (
    remove_unused_columns  = False,
     ...
     )

Topic		Replies	Views
Does the BERT model choose only the inputs it needs? Beginners	1	816	July 1, 2022
Remove columns from streamable datasets doesn't work 🤗Datasets	3	6132	January 24, 2024
Remove columns before training Beginners	0	836	January 4, 2023
Remove_columns option for .map Beginners	0	1683	July 22, 2022
Non-label features are not passed into data collator Beginners	6	49	January 1, 2025

RemoveColumnsCollator is removing all columns

Related topics