How to mark unknown values in ClassLabel with negative numbers?

Hello all,

I am working on loading a dataset with Nan values in a string column. The feature I am using for this column is ClassLabel and I want the NaN to be marked as -1 instead on None as it does currently.

The codebase here mentions that one can use negative integers to represent unknown/missing labels.

How do I enable/use this feature ?

In the example below: test_cat column needs to have -1 instead of None I presume when the feature is enabled.

Code example:

print(f"{datasets.__version__ = }")
req_cols = ["test_bool", "test_float", "test_int", "test_cat", "test_multi_str"]
cls_type = ClassLabel(num_classes=3, names=["a", "b", "c"])
features = Features(
   {
       "test_bool": Value("bool"),
       "test_cat": cls_type,
       "test_float": Value("float32"),
       "test_int": Value("int64"),
       "test_multi_str": Sequence(feature=cls_type),
   }
)
ds2 = load_dataset("parquet", data_files=[(OUTPUT_PATH / "test3.parquet").as_posix()], columns=req_cols, features=features)

ds2["train"][10:20]

Output:

datasets.__version__ = '2.18.0'
{'test_bool': [True, None, True, True, True, True, None, None, None, None],
 'test_cat': [1, 1, 2, None, 1, 2, None, 0, None, 2],
 'test_float': [None,
  3.140000104904175,
  0.0,
  3.140000104904175,
  3.140000104904175,
  None,
  0.0,
  2.7179999351501465,
  3.140000104904175,
  None],
 'test_int': [4, 2, 1, 2, 3, 5, 2, 3, 4, 3],
 'test_multi_str': [[0],
  None,
  [0, 1, 2],
  None,
  [0, 2],
  [1],
  None,
  [0],
  [0],
  [1]]}

Thanks.

Hi ! You can use .map() to convert the None values to -1.

There is no parameter in load_dataset to set this default value

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.