Data Conversion to Conll2003

I have downloaded a annotation.json file from the spacy free annotator NER tool. I am trying to convert the dataset in the format of conll2003. With the below code I am able to convert to the structure but I;m not sure how to add the features for the NER Tags in numbers. like if we run the below code conll2003[“train”].features[“ner_tags”] I see the output as per below. I’m unable to do the same to my dataset additionally, kindly let me how to create token in the same way as per the hugging face dataset conll2003.
Sequence(feature=ClassLabel(names=[‘O’, ‘B-PER’, ‘I-PER’, ‘B-ORG’, ‘I-ORG’, ‘B-LOC’, ‘I-LOC’, ‘B-MISC’, ‘I-MISC’], id=None), length=-1, id=None)

#Code to create the dataset structure
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

Splitting the training data into training and validation sets

train_data, validation_data = train_test_split(train_data, test_size=0.1, random_state=42)

train = Dataset.from_pandas(train_data)
test = Dataset.from_pandas(test_data)
val = Dataset.from_pandas(validation_data)

conll2003 = DatasetDict()
conll2003[‘train’] = train
conll2003[‘validation’] = val
conll2003[‘test’] = test

Hi! You can run conll2003 = conll2003.cast_column("ner_tags", Sequence(ClassLabel(names=list_of_labels))) to cast the NER tags to Sequence(ClassLabel(...))

Hi Mario, Thank you so much for the response. I updated the code as per below but I’m getting error AttributeError: ‘DataFrame’ object has no attribute ‘cast_column’ . I’m unable to go ahead on this. Is it possible to have a quick connect as per your convenience? it would be a big time help.

from datasets import ClassLabel, Sequence

list_of_labels = [‘O’, ‘DEBIT_AMT’, ‘CREDIT_AMT’, ‘BALANCE_AMT’, ‘BANK_NAME’, ‘ACCOUNT_NO’, ‘CARD_NO’]

df = df.cast_column(“ner_tags”, Sequence(ClassLabel(names=list_of_labels)))

cast_column should be called on a Dataset/DatasetDict object, not Pandas DataFrame

Thank you it worked, Any idea how can we create the pos_tags and chunk_tags?