I have a paragraph for example below
Either party may terminate this Agreement by written notice at any time if the other party defaults in the performance of its material obligations hereunder. In the event of such default, the party declaring the default shall provide the defaulting party with written notice setting forth the nature of the default, and the defaulting party shall have thirty (30) days to cure the default. If after such 30 day period the default remains uncured, the aggrieved party may terminate this Agreement by written notice to the defaulting party, which notice shall be effective upon receipt.
and then I need the Entity label and Entity value
Entity value = thirty (30) days
Entity label = Termination Notice Period
and I want to frame it as a Entity recognition task, So could you please tell me how would you have been approached?
Named-entity recognition (NER) is typically solved as a sequence tagging task, i.e. the model is trained to predict a label for every word. Typically one annotates NER datasets using the IOB annotation format (or one of its variants, like BIOES). Let’s take the example sentence from your paragraph. It would have to be annotated as follows:
the O
defaulting O
party O
shall O
have O
thirty B-TER
(30) I-TER
days I-TER
to O
cure O
the O
default O
. O
In other words, we annotate each word as being either outside a named entity (“O”), inside a named-entity (“I-TER”) or at the beginning of a named entity (“B-TER”).
However, there’s one additional challenge, in the sense that models like BERT operate on subword tokens, rather than words, meaning that a word like “hello” might be tokenized into [“hel”, “lo”]. This means that one should actually labels all tokens rather than all words, as BERT will be trained to predict a label for every token. There are multiple strategies here, one could either propagate the label to all subtokens of a word, or only label the first subword token of a given word.
You can take a look at my example notebooks that illustrate how to fine-tune BERT for NER.
Suppose that I would like to label “Niels” as person, and that the original IOB annotation looked as follows:
Niels B-PER
When we tokenize “Niels” using BertTokenizer, we get:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "Niels"
input_ids = tokenizer(text).input_ids
for id in input_ids: print(id, tokenizer.decode([id]))
This prints:
101 [CLS]
9152 ni
9050 ##els
102 [SEP]
As you can see, the word “Niels” has been tokenized into 2 tokens, namely “ni” and “##els”. The [CLS] and [SEP] tokens are special tokens which BERT uses by default - let’s ignore those for now. Suppose that the label index for B-PER is 1.
So now you have a choice: either you label both “ni” and “##els” with label index 1, either you only label the first subword token “ni” with 1 and the second one with -100. The latter assures that no loss will be taken into account for the second subword token.
the only change I did is removed .to_device coz It was giving error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)
Thanks, It worked now I trained the model and I saved the model as well, so how do I load the model and make prediction, I am completely new to hugging face, so how do I load it and make prediction. I have saved the tokenizer and model. @Emanuel
A quick way to make predictions with your model / tokenizer is with the pipeline() function, e.g.
from transformers import pipeline
# Note: the model and tokenizer directories are usually the same
ner_tagger = pipeline("ner", model="path/to/your/model/dir", tokenizer="path/to/your/tokenizer/dir")
text = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
entities = ner_tagger(text)
from transformers import pipeline
# Note: the model and tokenizer directories are usually the same
ner_tagger = pipeline("ner", model="E:\model\config.json", tokenizer="E:\model\vocab.txt")
text = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
entities = ner_tagger(text)
ValueError: Could not load model E:\model\config.json with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForTokenClassification'>, <class 'transformers.models.auto.modeling_tf_auto.TFAutoModelForTokenClassification'>, <class 'transformers.models.bert.modeling_bert.BertForTokenClassification'>, <class 'transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification'>).
Hey @ayush488, the model and tokenizer arguments should point to the directory where you saved the model / tokenizer with the save_pretrained() method. In other words, do the following work?
from transformers import pipeline
# Note: the model and tokenizer directories are usually the same
ner_tagger = pipeline("ner", model="E:\model", tokenizer="E:\model")
text = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
entities = ner_tagger(text)
Hmm looking at the error suggests that the pipeline is looking for a nested directory like model\model. Do you have all the model files in a subdirectory?