Issue with Code?

ResearchIndepdent · April 15, 2025, 1:02am

**Small sidenote, VS Code doesn’t signify any errors, so it should run completely fine. The issue I’m worrying (and has happened before), is where previous iterations of code ran completely through but spewed out blank results/inaccuracies.

John6666 · April 15, 2025, 1:15am

Where is about month?

# ... [Previous Code]

# Monthly Aggregation
monthly_top_categories = {}
for month_number in sorted(df_2021["month"].unique()):
    subset_month = df_2021[df_2021["month"] == month_number]
    if len(subset_month) == 0:
        continue
    top_month_cat = subset_month["predicted_category"].value_counts().idxmax()
    monthly_top_categories[month_number] = top_month_cat

print("\n=== TOP CATEGORIES BY MONTH (2021) ===")
for month_num in range(1, 13):
    if month_num in monthly_top_categories:
        print(f"Month {month_num}: {monthly_top_categories[month_num]}")
    else:
        print(f"Month {month_num}: No data")

# ... [Rest of the Code]

ResearchIndepdent · April 15, 2025, 1:22am

So

import pandas as pd
import numpy as np
from transformers import pipeline
import torch
from scipy.stats import entropy
from datetime import datetime

def batch_zero_shot_classification(
   texts,
   zero_shot_pipeline,
   candidate_labels,
   hypothesis_template="This text is about {}.",
   batch_size=16
):
   all_results = []
   n = len(texts)
   for start_i in range(0, n, batch_size):
       end_i = min(start_i + batch_size, n)
       batch_texts = texts[start_i:end_i]

       batch_outputs = zero_shot_pipeline(
           batch_texts,
           candidate_labels=candidate_labels,
           hypothesis_template=hypothesis_template
       )

       if isinstance(batch_outputs, dict):
           batch_outputs = [batch_outputs]

       all_results.extend(batch_outputs)

   return all_results

def soft_label_filter(results, min_confidence=0.5, entropy_threshold=1.2):
   clean = []
   for result in results:
       scores = result["scores"]
       labels = result["labels"]
       top_label = labels[0]
       top_score = scores[0]
       label_entropy = entropy(scores)
       if top_score >= min_confidence and label_entropy <= entropy_threshold:
           clean.append(top_label)
       else:
           clean.append("uncertain")
   return clean

def main(excel_file_path, sheet_name=0):
   device_id = 0 if torch.cuda.is_available() else -1
   batch_size_ca = 16
   batch_size_topics = 16

   print("Reading Excel file...")
   df = pd.read_excel(excel_file_path, sheet_name=sheet_name, dtype=str)
   required_columns = ["Title", "subjectTerms", "classification", "identifierKeywords", "pubdate"]
   for col in required_columns:
       if col not in df.columns:
           raise ValueError(f"Required column '{col}' is missing from the Excel file.")

   df["pubdate"] = pd.to_datetime(df["pubdate"], errors="coerce")
   df.dropna(subset=["pubdate"], inplace=True)
   df[required_columns] = df[required_columns].fillna("")
   df["combined_text"] = (
       df["Title"] + " " +
       df["subjectTerms"] + " " +
       df["classification"] + " " +
       df["identifierKeywords"]
   ).str.strip()

   candidate_labels_for_california = ["California", "Not California"]
   print("Loading zero-shot pipeline for California detection...")
   zero_shot_california = pipeline(
       "zero-shot-classification",
       model="facebook/bart-large-mnli",
       device=device_id
   )

   print("Classifying for California relevance...")
   texts_ca = df["combined_text"].tolist()
   ca_results = batch_zero_shot_classification(
       texts=texts_ca,
       zero_shot_pipeline=zero_shot_california,
       candidate_labels=candidate_labels_for_california,
       batch_size=batch_size_ca
   )

   filtered_labels = soft_label_filter(ca_results)
   df["is_california"] = [lbl == "California" for lbl in filtered_labels]

   removed_df = df[~df["is_california"]].copy()
   kept_df = df[df["is_california"]].copy()
   print("Removed:", len(removed_df))
   removed_df.to_excel("removed_articles.xlsx", index=False)

   df = kept_df.reset_index(drop=True)

   print("Loading zero-shot pipeline for topic detection...")
   zero_shot_topic = pipeline(
       "zero-shot-classification",
       model="facebook/bart-large-mnli",
       device=device_id
   )

   df_texts = df["combined_text"].tolist()
   topic_results = batch_zero_shot_classification(
       texts=df_texts,
       zero_shot_pipeline=zero_shot_topic,
       candidate_labels=["Technology", "Health", "Politics", "Sports", "Entertainment"],
       batch_size=batch_size_topics
   )

   predicted_categories = soft_label_filter(topic_results)
   df["predicted_category"] = predicted_categories

   df_2021 = df[df["pubdate"].dt.year == 2021].copy()
   df_2021["iso_year"] = df_2021["pubdate"].dt.isocalendar().year
   df_2021["iso_week"] = df_2021["pubdate"].dt.isocalendar().week
   df_2021["iso_day"] = df_2021["pubdate"].dt.isocalendar().day
   df_2021 = df_2021[df_2021["iso_year"] == 2021]

   weekly_top_categories = {}
   for week_number in sorted(df_2021["iso_week"].unique()):
       subset_week = df_2021[df_2021["iso_week"] == week_number]
       if len(subset_week) == 0:
           continue
       top_cat = subset_week["predicted_category"].value_counts().idxmax()
       weekly_top_categories[week_number] = top_cat

   print("\n=== TOP CATEGORIES BY ISO WEEK (2021) ===")
   for week_num in range(1, 54):
       if week_num in weekly_top_categories:
           print(f"ISO Week {week_num}: {weekly_top_categories[week_num]}")
       else:
           print(f"ISO Week {week_num}: No data")

   df_2021.to_excel("kept_and_categorized_articles_2021.xlsx", index=False)
   print("Done.")
monthly_top_categories = {}
for month_number in sorted(df_2021["month"].unique()):
   subset_month = df_2021[df_2021["month"] == month_number]
   if len(subset_month) == 0:
       continue
   top_month_cat = subset_month["predicted_category"].value_counts().idxmax()
   monthly_top_categories[month_number] = top_month_cat

print("\n=== TOP CATEGORIES BY MONTH (2021) ===")
for month_num in range(1, 13):
   if month_num in monthly_top_categories:
       print(f"Month {month_num}: {monthly_top_categories[month_num]}")
   else:
       print(f"Month {month_num}: No data")


if __name__ == "__main__":
   excel_file_path = r"C:\Users\Jamja\Downloads\DocumentschatGPT.xlsx"
   main(excel_file_path)

def external_gamma_per_row(df, columns, base_column="col_4"):
   if len(columns) == 0:
       raise ValueError("The 'columns' list cannot be empty.")
   gamma = df[base_column] / len(columns)
   gamma = gamma.replace(0, 1e-6)

   adjusted_columns = []
   for col in columns:
       adjusted = df[col] / gamma
       adjusted_columns.append(adjusted)

   result = np.prod(adjusted_columns, axis=0)
   return result

Would work without issue? Last time I had run a previous iteration, it had resulted in blank answers:
Week 1: (And then empty for all the other weeks).
This, and the fact it takes many hours, can be extremely frustrating, so any prior advice/suggestions would be grealtly appreciate!

John6666 · April 15, 2025, 1:56am

Week 1: (And then empty for all the other weeks).

If this code produces the result, then the format of the Excel file is probably different from what this program expects.

When you have a program like ChatGPT create something, the conditions you didn’t explicitly give it are the so-called drafts or concepts that they create in their imagination. Usually, they are not the finished product. You will need to do the fine-tuning yourself by researching and thinking about it, or you will need to give ChatGPT separate sample data and ask it about it.

ResearchIndepdent · April 15, 2025, 1:57am

My new code is not created by ChatGPT

ResearchIndepdent · April 15, 2025, 2:09am

Would there be any other workarounds? Using Hugging Face seems to be very hostile and unforgiving for this particular endeavor, but I know that of none that would process +18,000 different article titles

John6666 · April 15, 2025, 2:18am

Using Hugging Face seems to be very hostile and unforgiving for this particular endeavor

?

any other workarounds?

Since zero-shot classification is a method for classifying by specifying candidates each time, why not use normal text classification instead?
With ordinary text classification, the candidates are fixed for each model (without fine-tuning yourself), and classification is carried out within that range, but it is fully automatic within that range.

Topic		Replies	Views
Project: Create a new zero-shot model with NLI data 🤗 Course Projects	9	3652	April 11, 2023
Apply batched zero shot classification on HuggingFace datasets object 🤗Datasets	4	2409	April 9, 2021
New pipeline for zero-shot text classification 🤗Transformers	107	71680	February 17, 2025
The zero-shot-classification pipeline_tag does not honour hypothesis_template Site Feedback	0	784	March 12, 2021
Seperating Paragraphs in Text File Based on Topics for Zero-Shot Classification Beginners	1	215	May 8, 2024

Issue with Code?

Related topics