I’m trying to use BERT for custom token classification, I have annotated data for this task using doccano, the exported dataset in jsonl format looks like this:
{"text": "EU rejects German call to boycott British lamb.", "label": [ [0, 2, "ORG"], [11, 17, "MIS"] ]}
{"text": "Peter Blackburn", "label": [ [0, 15, "PER"] ]}
{"text": "President Obama", "label": [ [10, 15, "PER"] ]}
Following this, I got to load the data, but I can’t feed it to BERT yet.
I followed the tutorial on Huggingface for token classification, basically I was trying to get a dataset that look similar to this, just like the tutorial:
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 14041
})
validation: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 3250
})
test: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 3453
})
})
I’m trying to solve this by using pandas to tranform the data before using Dataset.from_pandas
, but I can’t get it to work. Maybe there are better way to do this. Any help would be appreciated.