I’m getting this error message in my command line output when trying to train the model from this tutorial:
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
This would be my full code:
num_epochs=2
batch_size=16
learning_rate=2e-5
train_dataset = datasets.load_dataset('rotten_tomatoes', split='train')
val_dataset = datasets.load_dataset('rotten_tomatoes', split='validation')
test_dataset = datasets.load_dataset('rotten_tomatoes', split='test')
# load in model
model = DistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=2
).cuda()
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_valid = val_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)
tokenized_train.set_format(type="torch", columns=["input_ids", "text", "attention_mask", "label"])
print('dataset format: ', tokenized_train.format['type'])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='pt')
# train
training_args = TrainingArguments(
output_dir='output_dir/',
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.01,
save_strategy="no",
push_to_hub=False,
evaluation_strategy='epoch',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Is this because the model is using the input_ids
or attention_mask
and its not using the text at all? I understand the purpose of tokenization is to make the text readable by the model by converting it to a numerical format, but I’m not sure how I would check what is being used in the model as training data and how to confirm it’s not the text
column in the dataset object.