Doccano dataset for named entity recognition task using BERT

I’m trying to use BERT for custom token classification, I have annotated data for this task using doccano, the exported dataset in jsonl format looks like this:

{"text": "EU rejects German call to boycott British lamb.", "label": [ [0, 2, "ORG"], [11, 17, "MIS"] ]}
{"text": "Peter Blackburn", "label": [ [0, 15, "PER"] ]}
{"text": "President Obama", "label": [ [10, 15, "PER"] ]}

Following this, I got to load the data, but I can’t feed it to BERT yet.

I followed the tutorial on Huggingface for token classification, basically I was trying to get a dataset that look similar to this, just like the tutorial:

    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 14041
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3250
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3453

I’m trying to solve this by using pandas to tranform the data before using Dataset.from_pandas, but I can’t get it to work. Maybe there are better way to do this. Any help would be appreciated.


That’s already great! Citing myself here :slight_smile: NER model fine tuning with labeled spans - #3 by nielsr

I found the solution, using this doccano/doccano-transformer from the docanno team. There is a catch though, it has an annoying bug solved by this, and since there is next to no documentation, we can use this to write the file. :ok_hand:

