RemoveColumnsCollator is removing all columns

RemoveColumnsCollator is removing all keys from variable “features”. No matter what model I choose or which dataset I choose. The input to the function transformers.trainer_utils._remove_columns is a tokenized string which is a dictionary with “input_ids” and “labels” as keys. But output is always None as it removes any key whose name is not in signature columns ( which is always [ args, kwargs, label, label_ids ] ) . This is resulting in null input to padding function which throws error.

    def __call__(self, features: List[dict]):
        #features = [self._remove_columns(feature) for feature in features]
        return self.data_collator(features)

I had to comment this line but I do not know the implications. It is ok to do so? And if this is the expected behavior of this function?

I faced the same problem. The corresponding error message I get is

ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['attention_mask']

Deep diving into the library code of transformers and torch I’ve found the following solution:
The function being executed, which you have commented, is the following:
(taken from transformers/trainer_utils.py:686)

    def _remove_columns(self, feature: dict) -> dict:
        if not isinstance(feature, dict):
            return feature
        if not self.message_logged and self.logger and self.model_name:
            ignored_columns = list(set(feature.keys()) - set(self.signature_columns))
            if len(ignored_columns) > 0:
                dset_description = "" if self.description is None else f"in the {self.description} set"
                self.logger.info(
                    f"The following columns {dset_description} don't have a corresponding argument in "
                    f"`{self.model_name}.forward` and have been ignored: {', '.join(ignored_columns)}."
                    f" If {', '.join(ignored_columns)} are not expected by `{self.model_name}.forward`, "
                    " you can safely ignore this message."
                )
                self.message_logged = True
        return {k: v for k, v in feature.items() if k in self.signature_columns}

As the log message states, we run into this problem, because we are trying to train a model, whose forward function does not use corresponding parameter names. For me this happened because I am trying to train a custom model with a custom forward function:

def forward(self, tokens, attention_mask): ...

So, the _remove_columns function removed all entries in the fetched datapoint, that do not correspond to 'tokens' or 'attention_mask', therefore raising the ValueError mentioned above. Changing the forward function to

def forward(self, input_ids, attention_mask): ...

has solved this problem for me.

Another problem popped up using the method I suggested, which I want to quickly outline:
The dataset still has all other columns removed, so in the example I gave above, dataset entries only include the values for input_ids and attention_mask. The problem with this is that the loss function also only gets these values. Therefore, to be able to gain access to the labels of your dataset you’d need to add an empty labels = None parameter to your models forward function, which doesn’t seem right. This might be a bug or maybe there is another more pythonic way of telling the Trainer to not remove columns from the dataset.

A look into the official implementation of BertForSequenceClassification clears this up. They indeed use a lot of different parameters in the models forward function:

def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]: ...

The labels parameter is not directly used in the calculation of the models logits but rather afterwards. If given, it is used to calculate and return losses.

Therefore, if you implement a custom model you should give all columns you will need from your dataset into the models forward function parameter definition (which may include the labels column).