Loading Custom Datasets

Hi ! The conll2003 script doesn’t parse json files, but rather text files in the conll format. This format has one line per token, with space-separated labels for the POS tag, the chunk tag and the NER tag.

So if you want to use a dataset script like conll you’d have to use the right format, or reimplement the _generate_examples method to parse the json data (this should be straightforward since you just have to iterate over the json data and yield each example).

Otherwise if you don’t want to implement a dataset script, it’s probably simpler to directly use the json loader:

from datasets import load_dataset

data_files = {
    "train": “./ADPConll/ADPConll.json”,
    "validation": “./ADPConll/ADPConll_valid.json”,
    "test": “./ADPConll/ADPConll_test.json”
}

dataset = load_dataset("json", data_files=data_files)

Also if you want you can define your own features (with your own labels):

import datasets

features = datasets.Features(
    {
        "id": datasets.Value("string"),
        "tokens": datasets.Sequence(datasets.Value("string")),
        "pos_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    '"',
                    "''",
                    "#",
                    "$",
                    "(",
                    ")",
                    ",",
                    ".",
                    ":",
                    "``",
                    "CC",
                    "CD",
                    "DT",
                    "EX",
                    "FW",
                    "IN",
                    "JJ",
                    "JJR",
                    "JJS",
                    "LS",
                    "MD",
                    "NN",
                    "NNP",
                    "NNPS",
                    "NNS",
                    "NN|SYM",
                    "PDT",
                    "POS",
                    "PRP",
                    "PRP$",
                    "RB",
                    "RBR",
                    "RBS",
                    "RP",
                    "SYM",
                    "TO",
                    "UH",
                    "VB",
                    "VBD",
                    "VBG",
                    "VBN",
                    "VBP",
                    "VBZ",
                    "WDT",
                    "WP",
                    "WP$",
                    "WRB",
                ]
            )
        ),
        "chunk_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-ADJP",
                    "I-ADJP",
                    "B-ADVP",
                    "I-ADVP",
                    "B-CONJP",
                    "I-CONJP",
                    "B-INTJ",
                    "I-INTJ",
                    "B-LST",
                    "I-LST",
                    "B-NP",
                    "I-NP",
                    "B-PP",
                    "I-PP",
                    "B-PRT",
                    "I-PRT",
                    "B-SBAR",
                    "I-SBAR",
                    "B-UCP",
                    "I-UCP",
                    "B-VP",
                    "I-VP",
                ]
            )
        ),
        "ner_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-PER",
                    "I-PER",
                    "B-ORG",
                    "I-ORG",
                    "B-LOC",
                    "I-LOC",
                    "B-MISC",
                    "I-MISC",
                ]
            )
        ),
    }
)

and then load the dataset with your own features:

dataset = load_dataset("json", data_files=data_files, features=features)

Let me know if that helps !

1 Like