Multi-label classification: getting Sequence(ClassList()) for labels

I’m attempting to convert an Image fine-tuning notebook to multi-label classification (there’s a few more questions coming!). I haven’t touched Python since 2.4 so am rusty! The first place I’m stuck is with my labels.

My source dataframe can contain the indicies for the matched labels (e.g. [3, 5]) or a list of zeros and ones for the categories [0, 0, 0, 1, 0, 1]. Older posts on this forum have said I have to use one, or the other. Whichever, I understand that for Huggingface to work, I need to convert them to Sequence(ClassList(names=classnames), ClassList(names=classnames), ClassList(names=classnames), ...)

First Question: how is the ClassList value set? On single-label classifications, this works great

ds = ds.cast_column("label", ClassLabel(num_classes=2, names=['accept', 'reject']))

but I don’t understand which position or named argument takes the column value. I’ve looked at the source code for ClassLabel and stil no clearer :slight_smile:

Second Question: how do I massage my labels into the right format to pass in for training? I tried with this and multiple other forms but cannot get it to work. Or am I going in the wrong direction?

df.labels = df['labels'].apply(lambda x, cl=classlist: [ClassLabel(names=cl) for y in list(x.split(','))])