Why am I getting constant predictions (but not when I use the older version in the google colab fake news notebook?)

I am trying to predict fake news. Originally I was trying to reproduce the example from this notebook, but as it was older, it seemed that some of the classes were legacy. Then I tried to predict fake news but I used a newer notebook as a guide (IMDb Classification with Trainer.ipynb). I was able to get predictions from the first example but predictions from the second example were all the same. For sake of clarity, I put my adapted code from the first example in a github repository here. Here is the code from my second example, which is also pasted below. Additionally you can find the data that I used for both examples in the repository for test and train in case you would like to try reproducing this problem.
Below the code from the second example is pasted and I have also added in all of the output. At the very end I also put in the output from the first example (which shows different predictions)

Thank you for reading this post.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, EvalPrediction, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from nlp import Dataset, load_dataset
from torch import tensor
import pandas as pd
import numpy as np
import torch

# read csv in pandas
df_train = pd.read_csv("~/Downloads/fakenewstrain.csv")
df_test = pd.read_csv("~/Downloads/fakenewstest.csv")

# convert pandas df (only columns 'titletext' and 'label') to nlp Dataset
ds_train = Dataset.from_pandas(df_train[['titletext','label']])
ds_test = Dataset.from_pandas(df_test[['titletext','label']])

# set up configuration, tokenizer and model
config = AutoConfig.from_pretrained('bert-base-uncased')    
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_config(config)

# function to tokenize a line of text using tokenizer
def tokenize(batch):
    return tokenizer(batch['titletext'], 
                     max_length = 16, 
                     truncation = True,
                     padding = 'max_length')

# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)

# Set format of Dataset, and specify columns to use 
# (columns are "input_ids", "attention mask", "token_type_ids" and "label")
train_dataset.set_format('torch', columns=['attention_mask','input_ids', 'label', 'token_type_ids'])
test_dataset.set_format('torch', columns=['attention_mask','input_ids', 'label', 'token_type_ids'])

training_args = TrainingArguments(

def compute_metrics(pred: EvalPrediction) -> dict():
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall

trainer = Trainer(model = model, 
                  args = training_args, 
                  train_dataset = train_dataset, 
                  eval_dataset = test_dataset,
                  compute_metrics = compute_metrics)

Then train:


The output here is TrainOutput(global_step=3150, training_loss=0.7043003456743937)

I also get this warning /opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) after some of the iterations.

Then evaluate:


Output here is:
{'eval_loss': 0.6933324187994003, 'eval_accuracy': 0.49682539682539684, 'eval_f1': 0.6638388123011665, 'eval_precision': 0.49682539682539684, 'eval_recall': 1.0, 'epoch': 10.0}

Then to see what the actual predictions are:

y_pred = trainer.predict(test_dataset)

and y_pred gives this
PredictionOutput(predictions=array([[-0.14629209, -0.12169936], [-0.14621323, -0.12170692], [-0.14623126, -0.12170577], ..., [-0.14624032, -0.12171113], [-0.14624621, -0.12170743], [-0.14624593, -0.12170401]], dtype=float32), label_ids=array([1, 1, 0, ..., 0, 1, 0]), metrics={'eval_loss': 0.6933324187994003, 'eval_accuracy': 0.49682539682539684, 'eval_f1': 0.6638388123011665, 'eval_precision': 0.49682539682539684, 'eval_recall': 1.0})
which shows that they are all the same.

I then found the probabilities:

df_preds.apply(lambda row : np.exp(row[0])/(np.exp(row[0])+np.exp(row[1])),axis=1)

output below shows they are all the same.

0       0.493852
1       0.493874
2       0.493869
3       0.493869
4       0.493860
1255    0.493880
1256    0.493867
1257    0.493868
1258    0.493866
1259    0.493865
Length: 1260, dtype: float64

Output from first example:

[[0.001976393861696124, 0.9980236291885376],
 [0.019398385658860207, 0.9806016683578491],
 [0.9896875023841858, 0.010312470607459545],
 [0.5828854441642761, 0.4171145558357239],
 [0.0018801470287144184, 0.9981197714805603],
 [0.9941951632499695, 0.00580484326928854],
 [0.9829741716384888, 0.0170258991420269],
 [0.994141161441803, 0.005858846940100193],


1 Like