Loading Custom Datasets

I am trying to load a custom dataset locally. This is a test dataset, will be revised soon, and will probably never be public so we would not want to put it on the HF Hub,
The dataset is in the same format as Conll2003. The idea is to train Bert on conll2003+the custom dataset.

The setup I am testing (I am open to changes) is to use a folder under the project folder called “ADPConll” with all the data files (just like the Conll2003 folder in git datasets) in it like so:

MainProjectFolder
ADPConll
ADPConll.py ← copy of conll2003.oy with minor changes (see below)
ADPConll.py.lock ← HF internal file
ADPConll.json ← train dataset in the format you see in the HF data browser.
ADPConll_test.json ← test dataset
ADPConll_valid.json ← validation dataset

Now the question comes as to how to set up the details. I think the key is in the ADPConll.py script where I have the following:
_URL = “./ADPConll/”
_TRAINING_FILE = “ADPConll.json”
_DEV_FILE = “ADPConll_valid.txt”
_TEST_FILE = “ADPConll_test.txt”

I have tried several different settings but I am trying to cut short the trail/error approach and see if we can get this documented.
Note that the format of the files matches what you see in the HF Streamlit data explorer:
Streamlit
Ideally we would train on a combination of the 2 datasets but we would add new labels to our custom dataset so it can recognize additional entities.

Hi ! The conll2003 script doesn’t parse json files, but rather text files in the conll format. This format has one line per token, with space-separated labels for the POS tag, the chunk tag and the NER tag.

So if you want to use a dataset script like conll you’d have to use the right format, or reimplement the _generate_examples method to parse the json data (this should be straightforward since you just have to iterate over the json data and yield each example).

Otherwise if you don’t want to implement a dataset script, it’s probably simpler to directly use the json loader:

from datasets import load_dataset

data_files = {
    "train": “./ADPConll/ADPConll.json”,
    "validation": “./ADPConll/ADPConll_valid.json”,
    "test": “./ADPConll/ADPConll_test.json”
}

dataset = load_dataset("json", data_files=data_files)

Also if you want you can define your own features (with your own labels):

import datasets

features = datasets.Features(
    {
        "id": datasets.Value("string"),
        "tokens": datasets.Sequence(datasets.Value("string")),
        "pos_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    '"',
                    "''",
                    "#",
                    "$",
                    "(",
                    ")",
                    ",",
                    ".",
                    ":",
                    "``",
                    "CC",
                    "CD",
                    "DT",
                    "EX",
                    "FW",
                    "IN",
                    "JJ",
                    "JJR",
                    "JJS",
                    "LS",
                    "MD",
                    "NN",
                    "NNP",
                    "NNPS",
                    "NNS",
                    "NN|SYM",
                    "PDT",
                    "POS",
                    "PRP",
                    "PRP$",
                    "RB",
                    "RBR",
                    "RBS",
                    "RP",
                    "SYM",
                    "TO",
                    "UH",
                    "VB",
                    "VBD",
                    "VBG",
                    "VBN",
                    "VBP",
                    "VBZ",
                    "WDT",
                    "WP",
                    "WP$",
                    "WRB",
                ]
            )
        ),
        "chunk_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-ADJP",
                    "I-ADJP",
                    "B-ADVP",
                    "I-ADVP",
                    "B-CONJP",
                    "I-CONJP",
                    "B-INTJ",
                    "I-INTJ",
                    "B-LST",
                    "I-LST",
                    "B-NP",
                    "I-NP",
                    "B-PP",
                    "I-PP",
                    "B-PRT",
                    "I-PRT",
                    "B-SBAR",
                    "I-SBAR",
                    "B-UCP",
                    "I-UCP",
                    "B-VP",
                    "I-VP",
                ]
            )
        ),
        "ner_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-PER",
                    "I-PER",
                    "B-ORG",
                    "I-ORG",
                    "B-LOC",
                    "I-LOC",
                    "B-MISC",
                    "I-MISC",
                ]
            )
        ),
    }
)

and then load the dataset with your own features:

dataset = load_dataset("json", data_files=data_files, features=features)

Let me know if that helps !

1 Like

Oh I just noticed that the json loader features= parameter doesn’t do class label encoding so it fails (see issue here).
As a workaround you can do

dataset = load_dataset("json", data_files=data_files)
dataset = dataset.map(features.encode_example, features=features)

Thanks Quentin, this has been very helpful.
I had to change pos, chunk, and ner in the features (from pos_tags, chunk_tags, ner_tags) but other than that I got much further.
What I am working on now is a call to
trainer.train()

I am getting the error:

  0%|          | 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 80, in default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
RuntimeError: Could not infer dtype of pyarrow.lib.Field
python-BaseException

At the point of error, k = {str} ‘_schema’ and v = {Schema:5} (SEE IMAGE) so I suspect that I am not setting a parameter correctly. I think my issue is that chunk, ner, and pos are lists of int64 instead of list of strings.

The code I have (as small as I think I can make it) is:

import datasets
from transformers import BertForTokenClassification
from transformers import Trainer, TrainingArguments

from datasets import load_dataset   
features = datasets.Features(
        {
            "id": datasets.Value("string"),
            "words": datasets.Sequence(datasets.Value("string")),
            "pos": datasets.Sequence(
                datasets.features.ClassLabel(
                    names=[
                        '"',
                        "''",
                        "#",
                        "$",
                        "(",
                        ")",
                        ",",
                        ".",
                        ":",
                        "``",
                        "CC",
                        "CD",
                        "DT",
                        "EX",
                        "FW",
                        "IN",
                        "JJ",
                        "JJR",
                        "JJS",
                        "LS",
                        "MD",
                        "NN",
                        "NNP",
                        "NNPS",
                        "NNS",
                        "NN|SYM",
                        "PDT",
                        "POS",
                        "PRP",
                        "PRP$",
                        "RB",
                        "RBR",
                        "RBS",
                        "RP",
                        "SYM",
                        "TO",
                        "UH",
                        "VB",
                        "VBD",
                        "VBG",
                        "VBN",
                        "VBP",
                        "VBZ",
                        "WDT",
                        "WP",
                        "WP$",
                        "WRB",
                    ]
                )
            ),
            "chunk": datasets.Sequence(
                datasets.features.ClassLabel(
                    names=[
                        "O",
                        "B-ADJP",
                        "I-ADJP",
                        "B-ADVP",
                        "I-ADVP",
                        "B-CONJP",
                        "I-CONJP",
                        "B-INTJ",
                        "I-INTJ",
                        "B-LST",
                        "I-LST",
                        "B-NP",
                        "I-NP",
                        "B-PP",
                        "I-PP",
                        "B-PRT",
                        "I-PRT",
                        "B-SBAR",
                        "I-SBAR",
                        "B-UCP",
                        "I-UCP",
                        "B-VP",
                        "I-VP",
                    ]
                )
            ),
            "ner": datasets.Sequence(
                datasets.features.ClassLabel(
                    names=[
                        "O",
                        "B-PER",
                        "I-PER",
                        "B-ORG",
                        "I-ORG",
                        "B-LOC",
                        "I-LOC",
                        "B-MISC",
                        "I-MISC",
                        "B-SSN",
                        "I-SSN",
                        "B-CITY",
                        "I-CITY",
                    ]
                )
            ),
        }
    )

    def ADPLoadData():
        dataFiles = {
            "train": "./ADPConll/ADPConll_train.json",
            "validation": "./ADPConll/ADPConll_valid.json",
            "test": "./ADPConll/ADPConll_test.json"
        }
        # dataset = load_dataset('json', data_files='./ADPConll/ADPConll_train.json')
        dataset = load_dataset('json', data_files=dataFiles)
        return dataset


    ADPDataset = ADPLoadData()
    ADPDataset = ADPDataset.map(features.encode_example, features=features)
    print(ADPDataset)

    modelName = 'dbmdz/bert-large-cased-finetuned-conll03-english'
    model = BertForTokenClassification.from_pretrained(modelName)

    training_args = TrainingArguments(
        output_dir='./results',  # output directory
        num_train_epochs=3,  # total number of training epochs
        per_device_train_batch_size=16,  # 1440 batch size per device during training
        per_device_eval_batch_size=16,  # 64,   # batch size for evaluation
        warmup_steps=500,  # number of warmup steps for learning rate scheduler
        weight_decay=0.01,  # strength of weight decay
        logging_dir='./logs',  # directory for storing logs
        logging_steps=10,
    )


    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=[ADPDataset.data['train']],  # ,  # training dataset
        eval_dataset=[ADPDataset.data['validation']],  # evaluation dataset
    )

    trainer.train()
    trainer.save_model('./newModel')

    trainer.evaluate()

@lhoestq can you help me with this? The data seems to load but I can’t train on it.
Here are the 2 lines of the data file I am using:

{"id": "0", "chunk_tags": [ "B-NP", "B-VP", "B-NP", "I-NP", "B-VP", "I-VP", "B-NP", "I-NP", "O" ], "ner_tags": [ "B-ORG", "O", "B-MISC", "O", "O", "O", "B-MISC", "O", "O" ], "pos_tags": [ "NNP", "VBZ", "JJ", "NN", "TO", "VB", "JJ", "NN", "." ], "tokens": [ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ]}
{"id": "0", "chunk_tags": [ "B-NP", "B-VP", "B-NP", "I-NP", "B-VP", "I-VP", "B-NP", "I-NP", "O" ], "ner_tags": [ "B-ORG", "O", "B-MISC", "O", "O", "O", "B-MISC", "O", "O" ], "pos_tags": [ "NNP", "VBZ", "JJ", "NN", "TO", "VB", "JJ", "NN", "." ], "tokens": [ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ]}

And here is the code:

import json

import datasets
from transformers import BertForTokenClassification
from transformers import Trainer, TrainingArguments

from datasets import load_dataset

features = datasets.Features(
    {
        "id": datasets.Value("string"),
        "tokens": datasets.Sequence(datasets.Value("string")),
        "pos_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    '"',
                    "''",
                    "#",
                    "$",
                    "(",
                    ")",
                    ",",
                    ".",
                    ":",
                    "``",
                    "CC",
                    "CD",
                    "DT",
                    "EX",
                    "FW",
                    "IN",
                    "JJ",
                    "JJR",
                    "JJS",
                    "LS",
                    "MD",
                    "NN",
                    "NNP",
                    "NNPS",
                    "NNS",
                    "NN|SYM",
                    "PDT",
                    "POS",
                    "PRP",
                    "PRP$",
                    "RB",
                    "RBR",
                    "RBS",
                    "RP",
                    "SYM",
                    "TO",
                    "UH",
                    "VB",
                    "VBD",
                    "VBG",
                    "VBN",
                    "VBP",
                    "VBZ",
                    "WDT",
                    "WP",
                    "WP$",
                    "WRB"
                ]
            )
        ),
        "chunk_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-ADJP",
                    "I-ADJP",
                    "B-ADVP",
                    "I-ADVP",
                    "B-CONJP",
                    "I-CONJP",
                    "B-INTJ",
                    "I-INTJ",
                    "B-LST",
                    "I-LST",
                    "B-NP",
                    "I-NP",
                    "B-PP",
                    "I-PP",
                    "B-PRT",
                    "I-PRT",
                    "B-SBAR",
                    "I-SBAR",
                    "B-UCP",
                    "I-UCP",
                    "B-VP",
                    "I-VP"
                ]
            )
        ),
        "ner_tags": datasets.Sequence(
            datasets.features.ClassLabel(
                names=[
                    "O",
                    "B-PER",
                    "I-PER",
                    "B-ORG",
                    "I-ORG",
                    "B-LOC",
                    "I-LOC",
                    "B-MISC",
                    "I-MISC",
                    # "B-SSN",
                    # "I-SSN",
                    # "B-CITY",
                    # "I-CITY"
                ]
            )
        ),
    }
)

dataFiles = {
    "train": "./ADPConll/ADPConll_train.json",
    "validation": "./ADPConll/ADPConll_valid.json",
    "test": "./ADPConll/ADPConll_test.json"
}
dataset = load_dataset('json', data_files=dataFiles, split='train')

# Call the features' object's encode_example() method on every feature value.
dataset = dataset.map(features.encode_example, features=features)
TrainDF = dataset.to_pandas()

modelName = 'bert-base-cased'
# modelName = 'dbmdz/bert-large-cased-finetuned-conll03-english'
model = BertForTokenClassification.from_pretrained(modelName)

training_args = TrainingArguments(
    output_dir='./results',  # output directory
    num_train_epochs=3,  # total number of training epochs
    per_device_train_batch_size=16,  # 1440 batch size per device during training
    per_device_eval_batch_size=16,  # 64,   # batch size for evaluation
    warmup_steps=500,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # strength of weight decay
    logging_dir='./logs',  # directory for storing logs
    logging_steps=10,
)


trainer = Trainer(
    model=model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=dataset,
    eval_dataset=dataset
)

trainer.train()

The output I get is:

Using custom data configuration default-d850b3a6520ef12b
Reusing dataset json (/Users/greggwcasey/.cache/huggingface/datasets/json/default-d850b3a6520ef12b/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02)
100%|██████████| 2/2 [00:00<00:00, 173.15ex/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: [‘cls.predictions.transform.LayerNorm.weight’, ‘cls.seq_relationship.bias’, ‘cls.predictions.bias’, ‘cls.seq_relationship.weight’, ‘cls.predictions.transform.LayerNorm.bias’, ‘cls.predictions.decoder.weight’, ‘cls.predictions.transform.dense.weight’, ‘cls.predictions.transform.dense.bias’]

  • This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: [‘classifier.weight’, ‘classifier.bias’]
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    0%| | 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 517, in next
    data = self._next_data()
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 557, in _next_data
    data = self._dataset_fetcher.fetch(index) # may raise StopIteration
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1345, in getitem
    return self._getitem(
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1337, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 365, in query_table
    _check_valid_index_key(key, size)
    File “/Users/greggwcasey/Google Drive/PycharmProjectsLocal/ADP_Project_NER/venv/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 308, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
    IndexError: Invalid key: 0 is out of bounds for size 0
    python-BaseException

What is the length of your dataset ? len(dataset["train"])

Can you also try to do dataset["train"][0] ?

@lhoestq, since I am doing the split=‘train’:
dataFiles = {
“train”: “./ADPConll/ADPConll_train.json”,
“validation”: “./ADPConll/ADPConll_valid.json”,
“test”: “./ADPConll/ADPConll_test.json”
}
dataset = load_dataset(‘json’, data_files=dataFiles, split=‘train’)

So I ran the following:

len(dataset)   which is {int} 2
dataset[0] which is {dict: 5}
{'id': '0', 
 'chunk_tags': ['B-NP', 'B-VP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'B-NP', 'I-NP', 'O'],
 'ner_tags': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'], 
 'pos_tags': ['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB', 'JJ', 'NN', '.'], 
 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']}

This IndexError you get during training means that it tries to query the first element of an empty dataset. However it looks like your dataset is not empty (2 elements), and that you are able to query the first element manually.

So it means that something makes the dataset empty at one point before passing the dataset to the Trainer. Can you also check the length of the dataset after applying this ?

dataset = dataset.map(features.encode_example, features=features)
EDIT: according to Kunal this is might be because `tokenize_and_align_labels` (from run_ner.py) needs to be applied on the dataset first. Indeed if the dataset doesn't contain the right input, the Trainer could see it as empty.