Hello!
Trying to learn and make myself more familiar to AI and ML stuff so tried to tinker about. I want classifier that classifies short descriptions of books (around 70 words) between comedy genre/non comedy.
Since it is all unlabeled zero-shot-classification seems to be the most adequate solution to “understand” the text and classify the books. I am using “facebook/bart-large-mnli” because it seems to be the apt tool to use after doing some research.
However after running my code, the results are pretty poor. The system classifies everything as comedy with scores of (0.8-0.95), even those books I know for a fact are more serious or even tragic.
I am a bit of a loss. I have no clue what to do to improve the performance. Here is my code:
import pandas as pd
from transformers import pipeline
pipe = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")
def load_csv_file(filename):
return pd.read_csv(filename)
if __name__ == "__main__":
comedy_score_zero_shot = []
df = load_csv_file("books.csv")
print(df)
for index, row in df.iterrows():
prompt = f'Given the sypnosis of the following work: {row["description"]}. Is it a comedy or a serious work?'
labels = [
"comedic",
"serious",
]
#print(prompt)
result = pipe(prompt, labels, hypothesis_template="This play is {}.")
print(f"{index} {row["title"]}: {result["scores"]}")
comedy_score_zero_shot.append(result["scores"][0])
df_scores = pd.DataFrame({'comedy_score_Zero_shot': comedy_score_zero_shot})
df_scores.to_csv('comedy_scores.csv', index=False)
Is something wrong? Am I missing something regarding zero-shot-classification? Is the prompt too long? Is this not the appropriate model to be using in this case? Any suggestions or criticisms are welcome!
Thanks!