Given a df with the following columns:
| text |
labels |
| This is a sentence |
[0, 5, 3] |
| I am unhappy |
[1, 9, 10] |
OR
| text |
labels1 |
label2 |
label3 |
| This is a sentence |
0 |
5 |
3 |
| I am unhappy |
1 |
9 |
10 |
How would one create a Dataset with the following structure of dataset.features:
features = Features({
'text': Value(dtype='string', id=None),
'labels': Sequence({
'label1': ClassLabel(num_classes=some_number, names=some_names,
'label2': ClassLabel(num_classes=some_number, names=some_names,
'label3': ClassLabel(num_classes=some_number, names=some_names
})
})
Using from_pandas() with the first df results in a Sequence, but of type int64 and with the second df it results in three features of type int64 which can be encoded as a ClassLabel using class_encode_column(), but then one would end up with multiple label features, whereas they are required to be in one feature called labels as I understood from this tutorial.
Therefore, how to create a DataSet with a Sequence feature that consists of ClassLabels that can be used for a classifier in multi-class and multi-label scenarios?
I saw this post, but the implementation is unclear to me.
You can use the first format and then cast the column to a Sequence of ClassLabel 
ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=...)))
Amazing, thank you!
If we assume this to be the complete dataframe (same as above):
| text |
labels |
| This is a sentence |
[0, 5, 3] |
| I am unhappy |
[1, 9, 10] |
The names= to added would be [0, 5, 3, 1, 9, 10]:
ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=[0, 5, 3, 1, 9, 10])))
However, for label1 (first list value respectively) only the labels 0 and 1 would be valid (for label2 it’d be 5 and 9 and so on)
Will this lead to any problems further on? E.g., could it mistakenly predict a 5 for label1?
In this case it seems you have multiple classifications to make, ideally each one should have its own column (and be a single integer if there is only one possibility, or a list if there are multiple possible labels per column)
Hi @lhoestq ,
I have a dataset formatted for multi-class classification with 4 classes. My objective is to convert the labels to one-hot encodings and create a Hugging Face dataset. To achieve this, I utilized the sklearn OneHotEncoder function for encoding. However, I encountered an error when attempting to convert the dataset to the Hugging Face format using the following line of code:
features = Features({"text": Value("string"), "label": Sequence(ClassLabel(names=['ang', 'hap', 'neu', 'sad']))})
The error message I received is as follows:
TypeError: Couldn't cast array of type
string
to
Sequence(feature=ClassLabel(names=['ang', 'hap', 'neu', 'sad'], id=None), length=-1, id=None)
Could you please provide guidance on how to resolve this issue?
Thank you!
Hi ! ClassLabel is for columns of integers. To transform your strings to integers you can use .map() (or the .class_encode_column() method which infers the names automatically)