I’d like to fine-tune at BERT for NER for token classification task. Primary reason for using NER is that it would encode the entire doc using NER which would teach the model INCOME > EXPENSE section within the document. With the vanilla BERT, we look at each line item individually thus potentially losing out on context from other relevant text in the document.
I have a pandas dataframe with some texts and labels.
import pandas as pd
import numpy as np
# Set a seed for reproducibility
np.random.seed(42)
# Generate dummy data
data = {
'Text': [
"Paid utility bills",
"Salary received from XYZ Corp",
"Dinner expenses at ABC Restaurant",
"Investment dividends",
"Rent payment"
],
'Label': ['Expense', 'Income', 'Expense', 'Income', 'Expense']
}
# Create a pandas DataFrame
df = pd.DataFrame(data)
How do I transform my dataset for BERT NER task? i.e how do I insert custom tags such that the data is in a similar format to some of the other datasets used for NER such as the one below.
I am expecting the data to be something like this:
Salary received from XYZ Corp [sep] Investment dividends [sep] Rent payment
Inc-B Inc-I Inc-I Inc-I 0 Inc-B Inc-I 0 Exp-B Exp-I
Salary received from XYZ Corp [ORG]
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("EvanD/dutch-ner-xlm-conll2003")
ner_model = AutoModelForTokenClassification.from_pretrained("EvanD/dutch-ner-xlm-conll2003")
nlp = pipeline("ner", model=ner_model, tokenizer=tokenizer, aggregation_strategy="simple")
example = "George Washington ging naar Washington"
ner_results = nlp(example)
print(ner_results)
# {
# "start_pos": 0,
# "end_pos": 17,
# "text": "George Washington",
# "score": 0.9999986886978149,
# "label": "PER"
# }
# {
# "start_pos": 28,
# "end_pos": 38,
# "text": "Washington",
# "score": 0.9999939203262329,
# "label": "LOC"
# }