I want to compare the performance of different BERT models when fine tuning on my tweets corpus. Here is part of the code I am using for that :
tokenizer = AutoTokenizer.from_pretrained(
"bert-base-uncased", padding='max_length', truncation=True)
model = TFAutoModelForSequenceClassification.from_pretrained(
'bert-base-uncased', num_labels=2)
model2 = TFAutoModelForSequenceClassification.from_pretrained(
'vinai/bertweet-base', num_labels=2)
opt = Adam(learning_rate=lr_scheduler)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#loss = tf.keras.losses.CategoricalCrossentropy()
with tf.device('/gpu:0'):
model2.compile(optimizer=opt, loss=loss, metrics=[my_metric])
click.secho('Fine tunning the BERTweet model',fg='yellow',bold=True)
model2.fit(
tf_train_dataset,
validation_data=tf_test_dataset,
epochs=num_epochs,
class_weight=class_weights
)
with tf.device('/gpu:0'):
model.compile(optimizer=opt, loss=loss, metrics=[my_metric])
click.secho('Fine tunning the BERTbase model',fg='yellow',bold=True)
model.fit(
tf_train_dataset,
validation_data=tf_test_dataset,
epochs=num_epochs+1,
class_weight=class_weights
)
Then I plot the ROC and PR curves as follows :
preds = model.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs=predictions2
# %% EVALUATION PERFORMANCE tweetbert:
preds = model2.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs_2=predictions2
preds = model.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs=predictions2
# %% EVALUATION PERFORMANCE tweetbert:
preds = model2.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs_2=predictions2
from sklearn.metrics import roc_curve, confusion_matrix, auc
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(8, 6), dpi=80)
fpr, tpr, thresh = roc_curve(test["classify"], lr_probs)
auc = metrics.roc_auc_score(test["classify"], lr_probs)
plt.plot(fpr,tpr,label="BERT, auc="+str(round(auc,3)))
fpr, tpr, thresh = roc_curve(test["classify"], lr_probs_2)
auc = metrics.roc_auc_score(test["classify"], lr_probs_2)
plt.plot(fpr,tpr,label="BERTweet, auc="+str(round(auc,3)))
What happens is that I only obtain logical performance in the model that I fine tune first (independently on whether this is bert-base or bertweet). The second one always gives a nonsensical AUC of around 30. The problem does not come from the prediction or plotting as already when fine tuning the reported accuracy of tensorflow is very low…
Also, I believe compiling the models right before calling the fit should be the way to go? Is there something I am missing?