Changing ClassLabels for NER

Hey there,

so I’ve been stuck the entire day and can not find anything to help me. Maybe I am blind, but I am also completely new to this (ML and Python that is)

Basically I have made a dataset that looks like this:

2ndBestKiller/DrugTest · Datasets at Hugging Face (this is exactly mine)

Now I wanted to do a NER Tutorial from the HF page, this one here:

For the most part, my dataset looks exactly like theirs, except for the ClassLabels in my ner_labels column. There should be references to the corresponding IOB entities, but alas there is nothing.

Here is what I am talking about:

ner_feature = raw_datasets[“train”].features[“ner_tags”]

when you print “ner_feature” is should look like this:

Sequence(feature=ClassLabel(num_classes=9, names=[‘O’, ‘B-PER’, ‘I-PER’, ‘B-ORG’, ‘I-ORG’, ‘B-LOC’, ‘I-LOC’, ‘B-MISC’, ‘I-MISC’], names_file=None, id=None), length=-1, id=None)

But mine looks like this:

Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None)

And I can not figure out how to set this up. Non of the tutorials or documentation seem to talk about this, but again, maybe I am just blind.

Can anyone help me (and please make it easy to understand since my knowledge in this area is very limited)

how did you upload the dataset? its missing the dataset_infos.json which will contain the feature definition etc.

If this file is available on your computer - upload that too, otherwise create it. Checkout the file for the dataset used in that example. The feature information you are looking for is part of the infos.json file which is missing in your repo.

Hey!

Thanks so much for your answer!

Thats interesting. So I actually figured out a different way. The thing is, I suspected something like a missing “config” file, since this kind of meta data cant be stored in a JSONL. But when I did it the way shown in this tutorial right here:

for split, dataset in drug_dataset_clean.items():
** dataset.to_json(f"drug-reviews-{split}.jsonl")**

it didnt create an info file. However if I write the entire DatasetDictionary to disk with the to_disk method I get the arrow format files with this exact info file you are mentioning. Hooooowever, when I uploaded that to HF via the website, it couldnt read the data and was just showing rubbish (I uploaded the entire folder of the DatasetDictionary that was created.

so what I eventually ended up doing was creating the ClassLabels like this:

ner_class_labels = ClassLabel(num_classes = 3,names=[‘O’, ‘B-DRUG’, ‘I-DRUG’])

train = train.cast_column(“ner_tags”, Sequence(ner_class_labels))
test = test.cast_column(“ner_tags”, Sequence(ner_class_labels))
validation = validation.cast_column(“ner_tags”, Sequence(ner_class_labels))
//train, test and validation are my datasets
→ Sequence being the key word here! Because this is what I was missing the entire time.

dataset_dict = DatasetDict({‘train’: train, ‘validation’: validation, ‘test’: test})

and pushing it to the hub via dataset_dict.push_to_hub(“myHub”, token = “mytoken”) method.

This way it directly casted each int in the int list to a ClassLabel with the meta information. And since I pushed it to the hub directly as arrow files in stored the meta information correctly in the ClassLabel objects without needing the info.json file.

But its good to know it works with the info file as well. However, I was only able to create it with the DatasetDictionary.save_to_disk method. Can you create it even if you just save a single dataset to disk as jsonl?

But at any rate, thanks again!

I have only used save to disk or push to hub methods so I haven’t seen this issue before.

Based on what you discovered I suspect the csv and json methods are for exporting to that format only and not meant for saving the dataset for later use. Thank you for sharing what you found