I am unsure how to proceed creating a Dataset with multiple labels and classes where the classes are not the same for the different labels.
A multi-label example is shared here, but the classes are always either 0 or 1. My task is slightly different. To illustrate what I mean, take this dataset:
text (X) |
region (y₁) |
weather (y₂) |
sentiment (y₃) |
|---|---|---|---|
| ‘The Taj Mahal was beautiful, even in monsoon season’ | ‘asia’ | ‘rain’ | ‘positive’ |
| ‘Surprisingly warm in London with blue skies…’ | ‘europe’ | ‘sun’ | ‘positive’ |
| ‘Snowstorms and icey roads in Illinios, will never go there again!!!’ | ‘america’ | ‘snow’ | ‘negative’ |
| ‘I sincerely enjoyed my Safari despite the hot temperatures’ | ‘africa’ | ‘sun’ | ‘positive’ |
text: The input to predict the three outputs.
region: First output to predict, four possible classes 'asia', 'europe', 'america', 'africa'
weather: Second output to predict, three possible classes 'rain', 'sun', 'snow'
sentiment: Third output to predict, two possible classes 'positive', 'negative'
The questions stemming from such a multi-label/multi-class task:
- What are the right types of Features? For
text, it shall beValue(dtype='string', id=None)feature type, which will be tokenized. - The outputs are supposed to saved in a
labelsfeature, illustrated in tutorials with theClassLabel. So for theregionlabel it should be something like this:ClassLabel(num_classes=3, names=['asia', 'europe', 'america'], id=None). But how does it work with three outputs and differentnames?
I tried looking into this where the labels feature would be of type Sequence (see this post).
However, as @lhoestq pointed out, they should have their own columns. Otherwise, names ends up having all classes in one place, e.g., Sequence(ClassLabel(names=['asia', 'europe', 'america', 'africa', 'rain', 'sun', 'snow', 'positive', 'negative'])) which would be wrong (?).
So now I am conflicted how to build my dataset to predict all labels and classes at once correctly.
More questions arising from this once we start training the model:
- Given the above dataset, how would the
id2labelandlabel2iddicts be structured? - What would be the correct
num_labelspassed on to the model? Would it be3due to the three columns to predict? Or would it be9(4+3+2) due to the different outcomes? - The
problem_typeshould bemulti_label_classification, right? (asking because in previous trainings I ran into issues with tensor sizes not matching and some issues with Float and Long)
Thanks!