Good day every one!
I am following a blog to fine-tuning a topic classification model and got trapped here. I’ve read the relevant discussions and tried the listed solutions, but my problem is still unresolved.
here is my preprocess_function:
def preprocess_function(data):
model_inputs = tokenizer(
data["text"],
padding="longest",
truncation=True
)
label_map = {label: index for index, label in id2label.items()}
labels = [label_map[l] for l in data["label"]]
model_inputs["labels"] = labels
return model_inputs
The result of preprocess_function() on test_dataset is okay, containing the desired fields and corresponding values. However, when applying the function to a single piece of data, there comes an error. For instance,
train_dataset[0]: {'text': 'time in germany', 'label': 'Clock'}
tokenized_sample = preprocess_function(train_dataset[0])
print(tokenized_sample)
Then I got the following KeyError:
KeyError Traceback (most recent call last)
<ipython-input-110-860026a66a82> in <cell line: 1>()
----> 1 tokenized_sample = preprocess_function(train_dataset[0])
2 print(tokenized_sample)
3 print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
4 print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")
1 frames
<ipython-input-105-2b50729ea506> in <listcomp>(.0)
7
8 label_map = {label: index for index, label in id2label.items()}
----> 9 labels = [label_map[l] for l in examples["label"]]
10 # labels = tokenizer(labels, padding="longest", truncation=True)
11 model_inputs["labels"] = labels
KeyError: 'C'
I assume there is something wrong with data type, but i am not sure how to fix it. And due to this error the following training code fails. The code is below:
BATCH_SIZE = 8
NUM_PROCS = 4
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
MODEL,
num_labels=len(id2label),
id2label=id2label,
label2id=label2id,
)
# Training arguments
OUT_DIR = "output"
LR = 1e-5
EPOCHS = 3
training_args = TrainingArguments(
output_dir=OUT_DIR,
learning_rate=LR,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
num_train_epochs=EPOCHS,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
save_total_limit=3,
report_to='tensorboard',
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=None, # No need for data collator as padding is handled in preprocess_function
compute_metrics=None,
)
trainer.train()
Many thanks for your help!