Loading Custom Datasets

lhoestq · May 17, 2021, 9:56am

Hi ! The conll2003 script doesn’t parse json files, but rather text files in the conll format. This format has one line per token, with space-separated labels for the POS tag, the chunk tag and the NER tag.

So if you want to use a dataset script like conll you’d have to use the right format, or reimplement the _generate_examples method to parse the json data (this should be straightforward since you just have to iterate over the json data and yield each example).

Otherwise if you don’t want to implement a dataset script, it’s probably simpler to directly use the json loader:

from datasets import load_dataset

data_files = {
    "train": “./ADPConll/ADPConll.json”,
    "validation": “./ADPConll/ADPConll_valid.json”,
    "test": “./ADPConll/ADPConll_test.json”
}

dataset = load_dataset("json", data_files=data_files)

Also if you want you can define your own features (with your own labels):

import datasets

features = datasets.Features(
    {
        "id": datasets.Value("string"),
        "tokens": datasets.Sequence(datasets.Value("string")),
        "pos_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    '"',
                    "''",
                    "#",
                    "$",
                    "(",
                    ")",
                    ",",
                    ".",
                    ":",
                    "``",
                    "CC",
                    "CD",
                    "DT",
                    "EX",
                    "FW",
                    "IN",
                    "JJ",
                    "JJR",
                    "JJS",
                    "LS",
                    "MD",
                    "NN",
                    "NNP",
                    "NNPS",
                    "NNS",
                    "NN|SYM",
                    "PDT",
                    "POS",
                    "PRP",
                    "PRP$",
                    "RB",
                    "RBR",
                    "RBS",
                    "RP",
                    "SYM",
                    "TO",
                    "UH",
                    "VB",
                    "VBD",
                    "VBG",
                    "VBN",
                    "VBP",
                    "VBZ",
                    "WDT",
                    "WP",
                    "WP$",
                    "WRB",
                ]
            )
        ),
        "chunk_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-ADJP",
                    "I-ADJP",
                    "B-ADVP",
                    "I-ADVP",
                    "B-CONJP",
                    "I-CONJP",
                    "B-INTJ",
                    "I-INTJ",
                    "B-LST",
                    "I-LST",
                    "B-NP",
                    "I-NP",
                    "B-PP",
                    "I-PP",
                    "B-PRT",
                    "I-PRT",
                    "B-SBAR",
                    "I-SBAR",
                    "B-UCP",
                    "I-UCP",
                    "B-VP",
                    "I-VP",
                ]
            )
        ),
        "ner_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-PER",
                    "I-PER",
                    "B-ORG",
                    "I-ORG",
                    "B-LOC",
                    "I-LOC",
                    "B-MISC",
                    "I-MISC",
                ]
            )
        ),
    }
)

and then load the dataset with your own features:

dataset = load_dataset("json", data_files=data_files, features=features)

Let me know if that helps !

Topic		Replies	Views
Fine Tuning BERT model on custom dataset 🤗Transformers	3	1178	January 27, 2022
How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset Beginners	5	1453	January 25, 2022
Data files not working with custom loading script and dataset 🤗Datasets	3	1282	May 2, 2023
Data_files not working with custom loading script and remote dataset 🤗Datasets	3	751	May 12, 2023
Loading community JSON based datasets without a script 🤗Datasets	3	518	October 4, 2021

Loading Custom Datasets

Related topics