Changing ClassLabels for NER

Hey!

Thanks so much for your answer!

Thats interesting. So I actually figured out a different way. The thing is, I suspected something like a missing “config” file, since this kind of meta data cant be stored in a JSONL. But when I did it the way shown in this tutorial right here:

for split, dataset in drug_dataset_clean.items():
** dataset.to_json(f"drug-reviews-{split}.jsonl")**

it didnt create an info file. However if I write the entire DatasetDictionary to disk with the to_disk method I get the arrow format files with this exact info file you are mentioning. Hooooowever, when I uploaded that to HF via the website, it couldnt read the data and was just showing rubbish (I uploaded the entire folder of the DatasetDictionary that was created.

so what I eventually ended up doing was creating the ClassLabels like this:

ner_class_labels = ClassLabel(num_classes = 3,names=[‘O’, ‘B-DRUG’, ‘I-DRUG’])

train = train.cast_column(“ner_tags”, Sequence(ner_class_labels))
test = test.cast_column(“ner_tags”, Sequence(ner_class_labels))
validation = validation.cast_column(“ner_tags”, Sequence(ner_class_labels))
//train, test and validation are my datasets
→ Sequence being the key word here! Because this is what I was missing the entire time.

dataset_dict = DatasetDict({‘train’: train, ‘validation’: validation, ‘test’: test})

and pushing it to the hub via dataset_dict.push_to_hub(“myHub”, token = “mytoken”) method.

This way it directly casted each int in the int list to a ClassLabel with the meta information. And since I pushed it to the hub directly as arrow files in stored the meta information correctly in the ClassLabel objects without needing the info.json file.

But its good to know it works with the info file as well. However, I was only able to create it with the DatasetDictionary.save_to_disk method. Can you create it even if you just save a single dataset to disk as jsonl?

But at any rate, thanks again!

1 Like