I have a csv file with two columns in which there are thousands of sentences (column 1, ‘sentence’) and they are marked as ‘type1’ and ‘type2’ (column 2, ‘label’). I need to build a classifier that learns to split incoming sentences into these two categories.
I tried to load the data and pass to:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
df = pd.read_csv('filename.csv')
ds = Dataset.from_pandas(df)
but it never works if I set the model’s num_labels
to anything other than 1. I get dimension errors. How do I specify in the dataset that there are 2 labels? (and maybe in general, how to specify that the label column is categorical, which N possible classes)
I’m really just trying to build a basic sentence classifier from my own labeled data…