I am unsure how to proceed creating a Dataset with multiple labels and classes where the classes are not the same for the different labels.
A multi-label example is shared here, but the classes are always either 0
or 1
. My task is slightly different. To illustrate what I mean, take this dataset:
text (X ) |
region (y₁ ) |
weather (y₂ ) |
sentiment (y₃ ) |
---|---|---|---|
‘The Taj Mahal was beautiful, even in monsoon season’ | ‘asia’ | ‘rain’ | ‘positive’ |
‘Surprisingly warm in London with blue skies…’ | ‘europe’ | ‘sun’ | ‘positive’ |
‘Snowstorms and icey roads in Illinios, will never go there again!!!’ | ‘america’ | ‘snow’ | ‘negative’ |
‘I sincerely enjoyed my Safari despite the hot temperatures’ | ‘africa’ | ‘sun’ | ‘positive’ |
text
: The input to predict the three outputs.
region
: First output to predict, four possible classes 'asia'
, 'europe'
, 'america'
, 'africa'
weather
: Second output to predict, three possible classes 'rain'
, 'sun'
, 'snow'
sentiment
: Third output to predict, two possible classes 'positive'
, 'negative'
The questions stemming from such a multi-label/multi-class task:
- What are the right types of Features? For
text
, it shall beValue(dtype='string', id=None)
feature type, which will be tokenized. - The outputs are supposed to saved in a
labels
feature, illustrated in tutorials with theClassLabel
. So for theregion
label it should be something like this:ClassLabel(num_classes=3, names=['asia', 'europe', 'america'], id=None)
. But how does it work with three outputs and differentnames
?
I tried looking into this where the labels
feature would be of type Sequence (see this post).
However, as @lhoestq pointed out, they should have their own columns. Otherwise, names
ends up having all classes in one place, e.g., Sequence(ClassLabel(names=['asia', 'europe', 'america', 'africa', 'rain', 'sun', 'snow', 'positive', 'negative']))
which would be wrong (?).
So now I am conflicted how to build my dataset to predict all labels and classes at once correctly.
More questions arising from this once we start training the model:
- Given the above dataset, how would the
id2label
andlabel2id
dicts be structured? - What would be the correct
num_labels
passed on to the model? Would it be3
due to the three columns to predict? Or would it be9
(4+3+2) due to the different outcomes? - The
problem_type
should bemulti_label_classification
, right? (asking because in previous trainings I ran into issues with tensor sizes not matching and some issues with Float and Long)
Thanks!