"following columns in the training set don't have a corresponding argument"

I’m getting this error message in my command line output when trying to train the model from this tutorial:

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.

This would be my full code:

    num_epochs=2
    batch_size=16
    learning_rate=2e-5

    train_dataset = datasets.load_dataset('rotten_tomatoes', split='train')
    val_dataset = datasets.load_dataset('rotten_tomatoes', split='validation')
    test_dataset = datasets.load_dataset('rotten_tomatoes', split='test')


    # load in model 
    model = DistilBertForSequenceClassification.from_pretrained(
        'distilbert-base-uncased', 
        num_labels=2
    ).cuda()

    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True)

    tokenized_train = train_dataset.map(preprocess_function, batched=True)
    tokenized_valid = val_dataset.map(preprocess_function, batched=True)
    tokenized_test = test_dataset.map(preprocess_function, batched=True)

    tokenized_train.set_format(type="torch", columns=["input_ids", "text", "attention_mask", "label"])
    print('dataset format: ', tokenized_train.format['type'])
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='pt')

    # train 
    training_args = TrainingArguments(
        output_dir='output_dir/',
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        save_strategy="no",
        push_to_hub=False,
        evaluation_strategy='epoch',
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()

Is this because the model is using the input_ids or attention_mask and its not using the text at all? I understand the purpose of tokenization is to make the text readable by the model by converting it to a numerical format, but I’m not sure how I would check what is being used in the model as training data and how to confirm it’s not the text column in the dataset object.

1 Like

A similar question has been answered here #23624