Non-label features are not passed into data collator

forrestbao · January 1, 2025, 12:27am

I am trying to implement a customized data collator that tokenizes batches on-the-fly. But somehow, features passed to it does not have any field/key other than label. Hence, in my code below, when trying to extract the text for tokenization, I get the error KeyError: 'text1'

I read the source code of collators in transformers But I don’t see where non-label fields are dropped. Can someone help me?

Below is my code which uses a hardcoded dataset of two keys/fields, namely text1 and label:


from typing import List, Dict, Tuple, Literal, Any
import torch 
import datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

train_ds = datasets.Dataset.from_dict(
                            {"text1":["A", "B", "C", "D", "E"], 
                             "label":[0, 0, 0, 1, 1]
                      })

class SmartCollator(DataCollatorWithPadding):
    """Tokenize each batch on the fly"""
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def __call__(self, features: List[Dict[str, Any]]):
        texts = [f["text1"] for f in features]
        labels = [f["label"] for f in features]

        encodings = self.tokenizer(texts, truncation=True, padding="longest", max_length=20, return_tensors="pt")

        return encodings.update({
            'labels': torch.tensor(labels)
        })

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=1,              
    per_device_train_batch_size=2,
    report_to="none"
)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model, 
    args=training_args,
    data_collator=SmartCollator(tokenizer=tokenizer),
    train_dataset=train_ds
)

trainer.train()

John6666 · January 1, 2025, 2:32am

There is probably a problem with the format of the data set. I think it will work as expected if you use the list of dictionaries by Dataset.from_list().

forrestbao · January 1, 2025, 3:52am

Thanks @John6666 I tried from_list() Unfortunately, it does not work either.

John6666 · January 1, 2025, 4:20am

Oh… Perhaps this?

Still error but improved maybe.

from typing import List, Dict, Tuple, Literal, Any
import torch 
import datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

#train_ds = datasets.Dataset.from_dict(
#                            {"text1":["A", "B", "C", "D", "E"], 
#                             "label":[0, 0, 0, 1, 1]
#                      })

train_ds = datasets.Dataset.from_list([{"text1": "A", "label": 0}, {"text1": "B", "label": 0}])

class SmartCollator(DataCollatorWithPadding):
    """Tokenize each batch on the fly"""
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def __call__(self, features: List[Dict[str, Any]]):
        print(features) # for debugging
        texts = [f["text1"] for f in features]
        labels = [f["label"] for f in features]

        encodings = self.tokenizer(texts, truncation=True, padding="longest", max_length=20, return_tensors="pt")

        return encodings.update({
            'labels': torch.tensor(labels)
        })

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=1,              
    per_device_train_batch_size=2,
    report_to="none",
    remove_unused_columns=False,
)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=SmartCollator(tokenizer=tokenizer),
    train_dataset=train_ds
)

trainer.train()

[{'text1': 'A', 'label': 0}, {'text1': 'B', 'label': 0}]

John6666 · January 1, 2025, 7:05am

I think I’ve solved it. dict.update() is a destructive method, so the return value is None. It cannot be returned.

from typing import List, Dict, Tuple, Literal, Any
import torch 
import datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

#train_ds = datasets.Dataset.from_dict(
#                            {"text1":["A", "B", "C", "D", "E"], 
#                             "label":[0, 0, 0, 1, 1]
#                      })

train_ds = datasets.Dataset.from_list([{"text1": "A", "label": 0}, {"text1": "B", "label": 0}])

class SmartCollator(DataCollatorWithPadding):
    """Tokenize each batch on the fly"""
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def __call__(self, features: List[Dict[str, Any]]):
        texts = [f["text1"] for f in features]
        labels = [f["label"] for f in features]

        encodings = self.tokenizer(texts, truncation=True, padding="longest", max_length=20, return_tensors="pt")
        encodings.update(labels=torch.tensor(labels))

        return encodings

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=1,              
    per_device_train_batch_size=2,
    report_to="none",
    remove_unused_columns=False,
)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=SmartCollator(tokenizer=tokenizer),
    train_dataset=train_ds
)

trainer.train()

forrestbao · January 1, 2025, 6:28pm

You are right. I forgot that .update() for a dict does not return the updated dict. Thanks so much.

So to summarize:
Besides the NoneType return of dict.update() in Python, the solution is that in TrainingArguments, set remove_unused_columns = False so columns unknown to the forward() method of the model (which I guess is input_ids, attention_mask, labels for many models) will be preserved rather than removed by default.

system · January 2, 2025, 6:29am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dataset expected by Trainer Beginners	5	9008	September 28, 2020
Trainer not passing all features of the dataset? 🤗Transformers	3	34	July 30, 2024
Unable to train token classification model 🤗Transformers	0	297	April 27, 2023
Creating Trainer object is deleting my 'labels' feature Beginners	3	1454	January 21, 2021
Problem with data collator Beginners	1	226	May 8, 2024

Non-label features are not passed into data collator

Related topics