Doccano dataset for named entity recognition task using BERT

ny-andry · May 8, 2024, 10:01am

I’m trying to use BERT for custom token classification, I have annotated data for this task using doccano, the exported dataset in jsonl format looks like this:

{"text": "EU rejects German call to boycott British lamb.", "label": [ [0, 2, "ORG"], [11, 17, "MIS"] ]}
{"text": "Peter Blackburn", "label": [ [0, 15, "PER"] ]}
{"text": "President Obama", "label": [ [10, 15, "PER"] ]}

Following this, I got to load the data, but I can’t feed it to BERT yet.

I followed the tutorial on Huggingface for token classification, basically I was trying to get a dataset that look similar to this, just like the tutorial:

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3453
    })
})

I’m trying to solve this by using pandas to tranform the data before using Dataset.from_pandas, but I can’t get it to work. Maybe there are better way to do this. Any help would be appreciated.

nielsr · May 8, 2024, 11:51am

Hi,

That’s already great! Citing myself here NER model fine tuning with labeled spans - #3 by nielsr

ny-andry · May 14, 2024, 5:55pm

I found the solution, using this doccano/doccano-transformer from the docanno team. There is a catch though, it has an annoying bug solved by this, and since there is next to no documentation, we can use this to write the file.

system · May 15, 2024, 5:56am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Do I have to only tokens in Bert dataset for token classification 🤗Datasets	0	131	January 18, 2024
Cannot get DataCollator to prepare tf dataset 🤗Transformers	0	477	July 15, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12844	February 12, 2024
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	86	January 9, 2025
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2727	November 23, 2021

Doccano dataset for named entity recognition task using BERT

Related topics