Evaluation performance issues with consecutive training of BERT models

I want to compare the performance of different BERT models when fine tuning on my tweets corpus. Here is part of the code I am using for that :

tokenizer = AutoTokenizer.from_pretrained(
        "bert-base-uncased", padding='max_length', truncation=True)
model = TFAutoModelForSequenceClassification.from_pretrained(
        'bert-base-uncased', num_labels=2)
model2 = TFAutoModelForSequenceClassification.from_pretrained(
        'vinai/bertweet-base', num_labels=2)
opt = Adam(learning_rate=lr_scheduler)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    #loss = tf.keras.losses.CategoricalCrossentropy()
    

    with tf.device('/gpu:0'):
        model2.compile(optimizer=opt, loss=loss, metrics=[my_metric])
        click.secho('Fine tunning the BERTweet model',fg='yellow',bold=True)
        model2.fit(
            tf_train_dataset,
            validation_data=tf_test_dataset,
            epochs=num_epochs,
            class_weight=class_weights
        )
    with tf.device('/gpu:0'):
        model.compile(optimizer=opt, loss=loss, metrics=[my_metric])
        click.secho('Fine tunning the BERTbase model',fg='yellow',bold=True)
        model.fit(
            tf_train_dataset,
            validation_data=tf_test_dataset,
            epochs=num_epochs+1,
            class_weight=class_weights
        )

Then I plot the ROC and PR curves as follows :

preds = model.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs=predictions2
# %% EVALUATION PERFORMANCE tweetbert:
preds = model2.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs_2=predictions2
preds = model.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs=predictions2
# %% EVALUATION PERFORMANCE tweetbert:
preds = model2.predict(tf_test_dataset)
predictions = tf.math.softmax(preds.logits, axis=-1)
predictions2=np.array(predictions)[:,1]
lr_probs_2=predictions2


from sklearn.metrics import roc_curve, confusion_matrix, auc
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(figsize=(8, 6), dpi=80)
fpr, tpr, thresh = roc_curve(test["classify"], lr_probs)
auc = metrics.roc_auc_score(test["classify"],  lr_probs)
plt.plot(fpr,tpr,label="BERT, auc="+str(round(auc,3)))
fpr, tpr, thresh = roc_curve(test["classify"], lr_probs_2)
auc = metrics.roc_auc_score(test["classify"],  lr_probs_2)
plt.plot(fpr,tpr,label="BERTweet, auc="+str(round(auc,3)))

What happens is that I only obtain logical performance in the model that I fine tune first (independently on whether this is bert-base or bertweet). The second one always gives a nonsensical AUC of around 30. The problem does not come from the prediction or plotting as already when fine tuning the reported accuracy of tensorflow is very low…

Also, I believe compiling the models right before calling the fit should be the way to go? Is there something I am missing?