Hello all!
I am working on classifying power plant outage reports into three target severity classes. I am new to NLP, but from what I was reading, this seems like a straight forward classification task that BERT can assist with.
Initially I was able to get BERT fine tuned to provide predictions on a test set by simply setting my model
with the following,
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
However after some reading it was not clear to me which loss function was being used with these settings. So I attempted to set model
with the following instead,
model = BertForSequenceClassification.from_pretrained(“bert-base-uncased”, num_labels=3,problem_type=“multi_label_classification”)
This results in the following Value error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-ddb64c2d7f44> in <module>()
----> 1 output = trainer.train()
2 print(output)
8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
3128
3129 if not (target.size() == input.size()):
-> 3130 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
3131
3132 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 3]))
I have potential a potential solutions regarding the use of .unsqueeze
shown here conv neural network - ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 1])) - Stack Overflow.
However, before I go implementing this I have some questions that I hope to get feedback on,
- What is the default loss function used when I am not using
binary_cross_entropy_with_logits
as set withproblem_type
? - Are there repercussions to using
unsqueeze
? “torch.unsqueeze — PyTorch 1.11.0 documentation” - How do I implement
.unsqueeze
? I see from documentation that I need to pass in my input tensor, but at what stage in theTrainer
api is my tensor being created?
Below is my version of transformers I am using through google colab and the code leading up to the error.
- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
CODE
! pip install git+https://github.com/huggingface/transformers.git
! pip install transformers datasets
! pip install pandas
! pip install comet_ml
! pip install comet_ml --upgrade
! pip install sklearn
! pip install transformers
! pip install --user urllib3==1.25.10
! pip install folium==0.2.1
import comet_ml
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset, dataset_dict, DatasetDict, Dataset, load_metric
from sklearn.model_selection import train_test_split
from pprint import pprint as pp
from sklearn.metrics import precision_recall_curve, roc_curve
comet_ml.init()
experiment = comet_ml.Experiment(
project_name="confusion-matrix",
)
N_CLASSES = 3
N_EPOCHS = 24
def compute_metrics(eval_pred):
experiment = comet_ml.get_global_experiment()
metric0 = load_metric("accuracy")
metric1 = load_metric("precision")
metric2 = load_metric("recall")
metric3 = load_metric("f1")
logits, labels = eval_pred
print(logits)
predictions = np.argmax(logits, axis=-1)
accuracy = metric0.compute(predictions=predictions, references=labels)["accuracy"]
precision = metric1.compute(predictions=predictions, references=labels, average="macro")["precision"]
recall = metric2.compute(predictions=predictions, references=labels, average="macro")["recall"]
f1 = metric3.compute(predictions=predictions, references=labels, average="macro")["f1"]
experiment.log_confusion_matrix(predictions, labels)
experiment.log_metric("accuracy",accuracy,epoch=N_EPOCHS)
experiment.log_metric("precision",precision,epoch=N_EPOCHS)
experiment.log_metric("recall",recall,epoch=N_EPOCHS)
experiment.log_metric("f1",f1,epoch=N_EPOCHS)
print(predictions,labels)
return {"accuracy":accuracy,"precision": precision, "recall": recall, "f1":f1}
# Reading in of data
df = pd.read_csv('noRefuelingOutagesUnderSampled.csv',usecols=['text','target'])
df = df.dropna()
df = df.rename(columns={'target':'labels'})
df['labels'] = df['labels']
dataset = Dataset.from_pandas(df)
# Tokenize
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Splitting into train, test, validation
train_test = dataset.train_test_split(test_size=0.30) # split dataset
test_valid = train_test['test'].train_test_split(test_size=0.70) # validation
mainDataset = DatasetDict({
'train': train_test['train'],
'test': test_valid['test'],
'valid': test_valid['train']})
print(train_test)
# Wrapper
def tokenize_function(example):
return tokenizer(example['text'],max_length=256,padding="max_length", truncation=True, add_special_tokens=True)
tokenized_datasets = mainDataset.map(tokenize_function,batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['text','__index_level_0__'])
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]
# Define Model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3,problem_type="multi_label_classification")
training_args = TrainingArguments("test_trainer",
evaluation_strategy="epoch", # Evaluation is done every epoch
num_train_epochs=N_EPOCHS, # Number of epochs
per_device_train_batch_size=8, # Training Batch size per GPU
per_device_eval_batch_size=32, # Evalution Batch size per GPU
logging_dir="bert_results/logs",
logging_steps=10,
)
trainer = Trainer(model=model,
args=training_args,
train_dataset=full_train_dataset,
eval_dataset=full_eval_dataset,
compute_metrics=compute_metrics,
)
output = trainer.train()
I tried being as clear as possible in this post, however I still am rookie. So if I missed to provide some important info please let me know!
Thanks again everyone!