I have downloaded a annotation.json file from the spacy free annotator NER tool. I am trying to convert the dataset in the format of conll2003. With the below code I am able to convert to the structure but I;m not sure how to add the features for the NER Tags in numbers. like if we run the below code conll2003[“train”].features[“ner_tags”] I see the output as per below. I’m unable to do the same to my dataset additionally, kindly let me how to create token in the same way as per the hugging face dataset conll2003.
Sequence(feature=ClassLabel(names=[‘O’, ‘B-PER’, ‘I-PER’, ‘B-ORG’, ‘I-ORG’, ‘B-LOC’, ‘I-LOC’, ‘B-MISC’, ‘I-MISC’], id=None), length=-1, id=None)
#Code to create the dataset structure
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
Splitting the training data into training and validation sets
train_data, validation_data = train_test_split(train_data, test_size=0.1, random_state=42)
train = Dataset.from_pandas(train_data)
test = Dataset.from_pandas(test_data)
val = Dataset.from_pandas(validation_data)
conll2003 = DatasetDict()
conll2003[‘train’] = train
conll2003[‘validation’] = val
conll2003[‘test’] = test