Given a df
with the following columns:
text |
labels |
This is a sentence |
[0, 5, 3] |
I am unhappy |
[1, 9, 10] |
OR
text |
labels1 |
label2 |
label3 |
This is a sentence |
0 |
5 |
3 |
I am unhappy |
1 |
9 |
10 |
How would one create a Dataset
with the following structure of dataset.features
:
features = Features({
'text': Value(dtype='string', id=None),
'labels': Sequence({
'label1': ClassLabel(num_classes=some_number, names=some_names,
'label2': ClassLabel(num_classes=some_number, names=some_names,
'label3': ClassLabel(num_classes=some_number, names=some_names
})
})
Using from_pandas()
with the first df results in a Sequence, but of type int64 and with the second df it results in three features of type int64 which can be encoded as a ClassLabel using class_encode_column()
, but then one would end up with multiple label features, whereas they are required to be in one feature called labels
as I understood from this tutorial.
Therefore, how to create a DataSet with a Sequence feature that consists of ClassLabels that can be used for a classifier in multi-class and multi-label scenarios?
I saw this post, but the implementation is unclear to me.
You can use the first format and then cast the column to a Sequence of ClassLabel ![:slight_smile: :slight_smile:](https://emoji.discourse-cdn.com/apple/slight_smile.png?v=12)
ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=...)))
Amazing, thank you!
If we assume this to be the complete dataframe (same as above):
text |
labels |
This is a sentence |
[0, 5, 3] |
I am unhappy |
[1, 9, 10] |
The names=
to added would be [0, 5, 3, 1, 9, 10]
:
ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=[0, 5, 3, 1, 9, 10])))
However, for label1
(first list value respectively) only the labels 0
and 1
would be valid (for label2
it’d be 5
and 9
and so on)
Will this lead to any problems further on? E.g., could it mistakenly predict a 5
for label1
?
In this case it seems you have multiple classifications to make, ideally each one should have its own column (and be a single integer if there is only one possibility, or a list if there are multiple possible labels per column)
Hi @lhoestq ,
I have a dataset formatted for multi-class classification with 4 classes. My objective is to convert the labels to one-hot encodings and create a Hugging Face dataset. To achieve this, I utilized the sklearn
OneHotEncoder
function for encoding. However, I encountered an error when attempting to convert the dataset to the Hugging Face format using the following line of code:
features = Features({"text": Value("string"), "label": Sequence(ClassLabel(names=['ang', 'hap', 'neu', 'sad']))})
The error message I received is as follows:
TypeError: Couldn't cast array of type
string
to
Sequence(feature=ClassLabel(names=['ang', 'hap', 'neu', 'sad'], id=None), length=-1, id=None)
Could you please provide guidance on how to resolve this issue?
Thank you!
Hi ! ClassLabel is for columns of integers. To transform your strings to integers you can use .map()
(or the .class_encode_column()
method which infers the names
automatically)