Using dataset for classifier with mutiple categories

I’m looking to use a hf movie dataset to classify movies, since the dataset includes that data along with plot summaries. However, a chunk of the movies have multiple genres in the genre category, like “war, action”. Can I use an LLM classifier to put things in one or more categories, or do I need data that only has one category?

1 Like

I think it’s probably possible.

That looks cool, but above my current level of understanding to implement. I switched datasets instead.

1 Like

Maybe try some Pre-Processing and make the multi-labels into a vector like this:

genres = [[“war”, “action”], [“comedy”], [“drama”, “romance”]]
genre_labels = mlb.fit_transform(genres)

Then you could use a model that supports multi-label classification then. This might seem like a good approach for you

1 Like

Thanks, that looks like a good approach. I checked what was in the genre field with select distinct…and it was a mess. There were nested hierarchies of movie genres 5 deep for some items, then sometimes just one or two words for others. It seemed like a bad dataset for a beginner like myself, so I found a clean one with 10 straightforward genres.

1 Like