Issue with Code?

**Small sidenote, VS Code doesn’t signify any errors, so it should run completely fine. The issue I’m worrying (and has happened before), is where previous iterations of code ran completely through but spewed out blank results/inaccuracies.

1 Like

Where is about month?

# ... [Previous Code]

# Monthly Aggregation
monthly_top_categories = {}
for month_number in sorted(df_2021["month"].unique()):
    subset_month = df_2021[df_2021["month"] == month_number]
    if len(subset_month) == 0:
        continue
    top_month_cat = subset_month["predicted_category"].value_counts().idxmax()
    monthly_top_categories[month_number] = top_month_cat

print("\n=== TOP CATEGORIES BY MONTH (2021) ===")
for month_num in range(1, 13):
    if month_num in monthly_top_categories:
        print(f"Month {month_num}: {monthly_top_categories[month_num]}")
    else:
        print(f"Month {month_num}: No data")

# ... [Rest of the Code]

So

import pandas as pd
import numpy as np
from transformers import pipeline
import torch
from scipy.stats import entropy
from datetime import datetime

def batch_zero_shot_classification(
   texts,
   zero_shot_pipeline,
   candidate_labels,
   hypothesis_template="This text is about {}.",
   batch_size=16
):
   all_results = []
   n = len(texts)
   for start_i in range(0, n, batch_size):
       end_i = min(start_i + batch_size, n)
       batch_texts = texts[start_i:end_i]

       batch_outputs = zero_shot_pipeline(
           batch_texts,
           candidate_labels=candidate_labels,
           hypothesis_template=hypothesis_template
       )

       if isinstance(batch_outputs, dict):
           batch_outputs = [batch_outputs]

       all_results.extend(batch_outputs)

   return all_results

def soft_label_filter(results, min_confidence=0.5, entropy_threshold=1.2):
   clean = []
   for result in results:
       scores = result["scores"]
       labels = result["labels"]
       top_label = labels[0]
       top_score = scores[0]
       label_entropy = entropy(scores)
       if top_score >= min_confidence and label_entropy <= entropy_threshold:
           clean.append(top_label)
       else:
           clean.append("uncertain")
   return clean

def main(excel_file_path, sheet_name=0):
   device_id = 0 if torch.cuda.is_available() else -1
   batch_size_ca = 16
   batch_size_topics = 16

   print("Reading Excel file...")
   df = pd.read_excel(excel_file_path, sheet_name=sheet_name, dtype=str)
   required_columns = ["Title", "subjectTerms", "classification", "identifierKeywords", "pubdate"]
   for col in required_columns:
       if col not in df.columns:
           raise ValueError(f"Required column '{col}' is missing from the Excel file.")

   df["pubdate"] = pd.to_datetime(df["pubdate"], errors="coerce")
   df.dropna(subset=["pubdate"], inplace=True)
   df[required_columns] = df[required_columns].fillna("")
   df["combined_text"] = (
       df["Title"] + " " +
       df["subjectTerms"] + " " +
       df["classification"] + " " +
       df["identifierKeywords"]
   ).str.strip()

   candidate_labels_for_california = ["California", "Not California"]
   print("Loading zero-shot pipeline for California detection...")
   zero_shot_california = pipeline(
       "zero-shot-classification",
       model="facebook/bart-large-mnli",
       device=device_id
   )

   print("Classifying for California relevance...")
   texts_ca = df["combined_text"].tolist()
   ca_results = batch_zero_shot_classification(
       texts=texts_ca,
       zero_shot_pipeline=zero_shot_california,
       candidate_labels=candidate_labels_for_california,
       batch_size=batch_size_ca
   )

   filtered_labels = soft_label_filter(ca_results)
   df["is_california"] = [lbl == "California" for lbl in filtered_labels]

   removed_df = df[~df["is_california"]].copy()
   kept_df = df[df["is_california"]].copy()
   print("Removed:", len(removed_df))
   removed_df.to_excel("removed_articles.xlsx", index=False)

   df = kept_df.reset_index(drop=True)

   print("Loading zero-shot pipeline for topic detection...")
   zero_shot_topic = pipeline(
       "zero-shot-classification",
       model="facebook/bart-large-mnli",
       device=device_id
   )

   df_texts = df["combined_text"].tolist()
   topic_results = batch_zero_shot_classification(
       texts=df_texts,
       zero_shot_pipeline=zero_shot_topic,
       candidate_labels=["Technology", "Health", "Politics", "Sports", "Entertainment"],
       batch_size=batch_size_topics
   )

   predicted_categories = soft_label_filter(topic_results)
   df["predicted_category"] = predicted_categories

   df_2021 = df[df["pubdate"].dt.year == 2021].copy()
   df_2021["iso_year"] = df_2021["pubdate"].dt.isocalendar().year
   df_2021["iso_week"] = df_2021["pubdate"].dt.isocalendar().week
   df_2021["iso_day"] = df_2021["pubdate"].dt.isocalendar().day
   df_2021 = df_2021[df_2021["iso_year"] == 2021]

   weekly_top_categories = {}
   for week_number in sorted(df_2021["iso_week"].unique()):
       subset_week = df_2021[df_2021["iso_week"] == week_number]
       if len(subset_week) == 0:
           continue
       top_cat = subset_week["predicted_category"].value_counts().idxmax()
       weekly_top_categories[week_number] = top_cat

   print("\n=== TOP CATEGORIES BY ISO WEEK (2021) ===")
   for week_num in range(1, 54):
       if week_num in weekly_top_categories:
           print(f"ISO Week {week_num}: {weekly_top_categories[week_num]}")
       else:
           print(f"ISO Week {week_num}: No data")

   df_2021.to_excel("kept_and_categorized_articles_2021.xlsx", index=False)
   print("Done.")
monthly_top_categories = {}
for month_number in sorted(df_2021["month"].unique()):
   subset_month = df_2021[df_2021["month"] == month_number]
   if len(subset_month) == 0:
       continue
   top_month_cat = subset_month["predicted_category"].value_counts().idxmax()
   monthly_top_categories[month_number] = top_month_cat

print("\n=== TOP CATEGORIES BY MONTH (2021) ===")
for month_num in range(1, 13):
   if month_num in monthly_top_categories:
       print(f"Month {month_num}: {monthly_top_categories[month_num]}")
   else:
       print(f"Month {month_num}: No data")


if __name__ == "__main__":
   excel_file_path = r"C:\Users\Jamja\Downloads\DocumentschatGPT.xlsx"
   main(excel_file_path)

def external_gamma_per_row(df, columns, base_column="col_4"):
   if len(columns) == 0:
       raise ValueError("The 'columns' list cannot be empty.")
   gamma = df[base_column] / len(columns)
   gamma = gamma.replace(0, 1e-6)

   adjusted_columns = []
   for col in columns:
       adjusted = df[col] / gamma
       adjusted_columns.append(adjusted)

   result = np.prod(adjusted_columns, axis=0)
   return result 

Would work without issue? Last time I had run a previous iteration, it had resulted in blank answers:
Week 1: (And then empty for all the other weeks).
This, and the fact it takes many hours, can be extremely frustrating, so any prior advice/suggestions would be grealtly appreciate!

1 Like

Week 1: (And then empty for all the other weeks).

If this code produces the result, then the format of the Excel file is probably different from what this program expects.:thinking:

When you have a program like ChatGPT create something, the conditions you didn’t explicitly give it are the so-called drafts or concepts that they create in their imagination. Usually, they are not the finished product. You will need to do the fine-tuning yourself by researching and thinking about it, or you will need to give ChatGPT separate sample data and ask it about it.

My new code is not created by ChatGPT

1 Like

Would there be any other workarounds? Using Hugging Face seems to be very hostile and unforgiving for this particular endeavor, but I know that of none that would process +18,000 different article titles

1 Like

Using Hugging Face seems to be very hostile and unforgiving for this particular endeavor

?

any other workarounds?

Since zero-shot classification is a method for classifying by specifying candidates each time, why not use normal text classification instead?
With ordinary text classification, the candidates are fixed for each model (without fine-tuning yourself), and classification is carried out within that range, but it is fully automatic within that range.