Hello all,
I am working on loading a dataset with Nan
values in a string column. The feature I am using for this column is ClassLabel
and I want the NaN
to be marked as -1
instead on None
as it does currently.
The codebase here mentions that one can use negative integers to represent unknown/missing labels.
How do I enable/use this feature ?
In the example below: test_cat
column needs to have -1
instead of None
I presume when the feature is enabled.
Code example:
print(f"{datasets.__version__ = }")
req_cols = ["test_bool", "test_float", "test_int", "test_cat", "test_multi_str"]
cls_type = ClassLabel(num_classes=3, names=["a", "b", "c"])
features = Features(
{
"test_bool": Value("bool"),
"test_cat": cls_type,
"test_float": Value("float32"),
"test_int": Value("int64"),
"test_multi_str": Sequence(feature=cls_type),
}
)
ds2 = load_dataset("parquet", data_files=[(OUTPUT_PATH / "test3.parquet").as_posix()], columns=req_cols, features=features)
ds2["train"][10:20]
Output:
datasets.__version__ = '2.18.0'
{'test_bool': [True, None, True, True, True, True, None, None, None, None],
'test_cat': [1, 1, 2, None, 1, 2, None, 0, None, 2],
'test_float': [None,
3.140000104904175,
0.0,
3.140000104904175,
3.140000104904175,
None,
0.0,
2.7179999351501465,
3.140000104904175,
None],
'test_int': [4, 2, 1, 2, 3, 5, 2, 3, 4, 3],
'test_multi_str': [[0],
None,
[0, 1, 2],
None,
[0, 2],
[1],
None,
[0],
[0],
[1]]}
Thanks.