Create custom tags for fine-tuning Bert for NER task

keval-sha · January 22, 2024, 10:47pm

I’d like to fine-tune at BERT for NER for token classification task. Primary reason for using NER is that it would encode the entire doc using NER which would teach the model INCOME > EXPENSE section within the document. With the vanilla BERT, we look at each line item individually thus potentially losing out on context from other relevant text in the document.

I have a pandas dataframe with some texts and labels.

import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

# Generate dummy data
data = {
    'Text': [
        "Paid utility bills",
        "Salary received from XYZ Corp",
        "Dinner expenses at ABC Restaurant",
        "Investment dividends",
        "Rent payment"
    ],
    'Label': ['Expense', 'Income', 'Expense', 'Income', 'Expense']
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

How do I transform my dataset for BERT NER task? i.e how do I insert custom tags such that the data is in a similar format to some of the other datasets used for NER such as the one below.

I am expecting the data to be something like this:

Salary received from XYZ Corp [sep] Investment dividends [sep] Rent payment
Inc-B Inc-I Inc-I Inc-I 0 Inc-B Inc-I 0 Exp-B Exp-I

Salary received from XYZ Corp [ORG]

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("EvanD/dutch-ner-xlm-conll2003")
ner_model = AutoModelForTokenClassification.from_pretrained("EvanD/dutch-ner-xlm-conll2003")

nlp = pipeline("ner", model=ner_model, tokenizer=tokenizer, aggregation_strategy="simple")
example = "George Washington ging naar Washington"

ner_results = nlp(example)
print(ner_results)

# {
#     "start_pos": 0,
#     "end_pos": 17,
#     "text": "George Washington",
#     "score": 0.9999986886978149,
#     "label": "PER"
# }
# {
#     "start_pos": 28,
#     "end_pos": 38,
#     "text": "Washington",
#     "score": 0.9999939203262329,
#     "label": "LOC"
# }

Topic		Replies	Views
Create own dataset for NER Beginners	3	6173	November 22, 2023
NER model fine tuning with labeled spans Beginners	5	3914	May 7, 2023
Improving NER BERT performing POS tagging Beginners	3	2151	November 23, 2020
How to use additional input features for NER? Beginners	27	15968	June 5, 2023
Fine tuning NER BERT model on Phone numbers Beginners	3	1174	May 31, 2024

Create custom tags for fine-tuning Bert for NER task

Related topics