Data Conversion to Conll2003

sushantsur23 · September 30, 2023, 3:33am

I have downloaded a annotation.json file from the spacy free annotator NER tool. I am trying to convert the dataset in the format of conll2003. With the below code I am able to convert to the structure but I;m not sure how to add the features for the NER Tags in numbers. like if we run the below code conll2003[“train”].features[“ner_tags”] I see the output as per below. I’m unable to do the same to my dataset additionally, kindly let me how to create token in the same way as per the hugging face dataset conll2003.
Sequence(feature=ClassLabel(names=[‘O’, ‘B-PER’, ‘I-PER’, ‘B-ORG’, ‘I-ORG’, ‘B-LOC’, ‘I-LOC’, ‘B-MISC’, ‘I-MISC’], id=None), length=-1, id=None)

#Code to create the dataset structure
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

Splitting the training data into training and validation sets

train_data, validation_data = train_test_split(train_data, test_size=0.1, random_state=42)

train = Dataset.from_pandas(train_data)
test = Dataset.from_pandas(test_data)
val = Dataset.from_pandas(validation_data)

conll2003 = DatasetDict()
conll2003[‘train’] = train
conll2003[‘validation’] = val
conll2003[‘test’] = test

mariosasko · October 3, 2023, 2:18pm

Hi! You can run conll2003 = conll2003.cast_column("ner_tags", Sequence(ClassLabel(names=list_of_labels))) to cast the NER tags to Sequence(ClassLabel(...))

sushantsur23 · October 19, 2023, 3:50pm

Hi Mario, Thank you so much for the response. I updated the code as per below but I’m getting error AttributeError: ‘DataFrame’ object has no attribute ‘cast_column’ . I’m unable to go ahead on this. Is it possible to have a quick connect as per your convenience? it would be a big time help.

from datasets import ClassLabel, Sequence

list_of_labels = [‘O’, ‘DEBIT_AMT’, ‘CREDIT_AMT’, ‘BALANCE_AMT’, ‘BANK_NAME’, ‘ACCOUNT_NO’, ‘CARD_NO’]

df = df.cast_column(“ner_tags”, Sequence(ClassLabel(names=list_of_labels)))

mariosasko · October 20, 2023, 2:19pm

cast_column should be called on a Dataset/DatasetDict object, not Pandas DataFrame

sushantsur23 · December 28, 2023, 7:26am

Thank you it worked, Any idea how can we create the pos_tags and chunk_tags?

Topic		Replies	Views
How to apply training ClassLabels on test / validation Dataset 🤗Datasets	2	369	September 20, 2023
Changing ClassLabels for NER Beginners	3	527	November 13, 2023
Problems when I try to cvt my .csv file into conll2003 format Beginners	0	422	October 13, 2021
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2721	November 23, 2021
NER model fine tuning with labeled spans Beginners	5	3904	May 7, 2023

Data Conversion to Conll2003

Splitting the training data into training and validation sets

Related topics