ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string

I have no idea why Iā€™m getting this error when Iā€™m trying to load_dataset

classFeatures = Features({
ā€˜ner_tagsā€™: ClassLabel(num_classes=3, names=[ā€˜Oā€™, ā€˜B-FARā€™, ā€˜I-FARā€™])
})
dataset = load_dataset(ā€œjsonā€,
data_files=ā€œdata.jsonlā€,
use_auth_token=True,
features=classFeatures)

Thanks

Hi ! I think this is because the classFeatures is missing the fact that ner_tags is actually a sequence of class labels:

classFeatures = Features({
    ā€˜ner_tagsā€™: Sequence(ClassLabel(names=[ā€˜Oā€™, ā€˜B-FARā€™, ā€˜I-FARā€™]))
})

Okay, I forgot about that :confused:

Now the error is different: ArrowInvalid: Failed to parse string: ā€˜Oā€™ as a scalar of type int64

If I recall correctly this issue has been fixed in a recent version of the library, could you try updating datasets ?

Yes :frowning:

import datasets
print(datasets.version)

1.18.3

Hi! If Iā€™m not mistaken, this doesnā€™t work for class labels nested inside a dict or a list. I think we will push the fix before the next release. In the meantime, load the dataset without specifying features and do the map where you convert tags to integers and set features to classFeatures.

@lhoestq WDYT about adding the cast_storage method to ClassLabel as well, to support str ā†’ int conversion?

Itā€™s not just casting (in the sense of manipulating arrays/buffers and dtypes), but a processing operation. Because of that and to have good performance and reasonable memory usage, using map (or something similar) is probably best (especially for big datasets).

Hi, Iā€™m trying to create a dataset whose ner_tags feature is of type ClassLabel, but casting is not possible when tags are nested inside a list as you said. Any idea on how to achieve this? Thanks xx