Ask for help with prediction results of Named Entity Recognition Task

Hi guys,

After training the NER Task with using RoBERTa Architecture, I got the below result

{‘eval_loss’: 0.003242955543100834,
‘eval_precision’: 0.9959672534053343,
‘eval_recall’: 0.9959672534053343,
‘eval_f1’: 0.9959672534053343,
‘eval_accuracy’: 0.9995624335836689}

The result generally is quite high, as I expected. But here is my confusion, when I randomly input a set of sentences (out of the training set) to really know the model’s performance.

My pseudo code

    def tokenize_and_align_labels_random(examples, tokenizer):
        tokenized_inputs = tokenizer(examples['tokens'], 
                                     truncation=True, 
                                     is_split_into_words=True)
        return tokenized_inputs
    def preprocess_datasets(tokenizer, **datasets) -> Dict[str, Dataset]:
        tokenize_ner = partial(tokenize_and_align_labels_random, 
                               tokenizer=tokenizer)
        return {k: ds.map(tokenize_ner) for k, ds in datasets.items()}

     address=Testing_Dataset[Testing_Dataset['address']==1]['text'].apply(clean_doc).tolist()

    da_datasets_random_Test = preprocess_datasets(tokenizer,
    test=Dataset.from_dict({'tokens':address}))

    results=da_trainer.predict(da_datasets_random_Test['test'])

    predictions=results.predictions
    predictions = np.argmax(predictions, axis=2)
    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

I input the sentences with some words that don’t exist in the tokenizer vocabulary, and the model will handle that part for me by automatically generating their sub token.

That means the ‘input_ids’ will generate more token ids for presenting these cases, the problem is their predicted tags will also be increasing (based on how many tokens was delivered to the model).

For instance

  • Input sentence: “Giao tĂŽi lĂȘ_lai phường hai tĂąn_bĂŹnh hcm”
  • Value after tokenizer:
    {‘input_ids’: [0, 64003, 64003, 17489, 6115, 64139, 64151, 64003, 6446, 64313, 1340, 74780, 2], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • Because tokenize of “lĂȘ_lai” is [‘lĂȘ@@’, ‘l@@', ‘ai’]; of “tĂąn_bĂŹnh” is ['tĂąn@@’, ‘bĂŹnh’]; of “hcm” is [‘h@@’, ‘cm’]

The result I got after all: [‘O’,‘O’,‘B-LOC’,‘I-LOC’,‘I-LOC’,‘I-LOC’,‘I-LOC’,‘I-LOC’,‘O’,‘I-LOC’,‘I-LOC’, ‘O’]

In fact, their prediction should only have 7 tags for the input tokens, but now it was more than this. So do guys have any strategies for this (I got one that we can train the tokenizer with more tokens).

I do appreciate your time and sharing.

Hello Iacle.
I suspect your issue is around WordPiece Tokenization but I can’t tell for sure with the info you posted.
Take a look at
Fine-tuning with custom datasets — transformers 4.5.0.dev0 documentation
In particular pay attention to when it talks about WordPiece Tokenization and the code in
def encode_tags(tags, encodings): 


1 Like

Thank you so much @g3casey, I did not notice that the document has already given a note.

However, there is another problem that arise, I did correct my tokenizer loading with the new parameter of return_offsets_mapping

tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True,return_offsets_mapping=True).

Screen Shot 2021-05-15 at 21.53.29

And I also confused that in case of not set True for return_offsets_mapping in the training progress, how can I get that high result
 :sleepy:

It looks like you are not using the “fast” version of the tokenizer. Check to make sure.

https://huggingface.co/transformers/model_doc/roberta.html#robertatokenizerfast
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

tokenizer(“Hello world”)[‘input_ids’]
[0, 31414, 232, 328, 2]
tokenizer(" Hello world")[‘input_ids’]
[0, 20920, 232, 2]

I did get your point, in the case of using PhoBERT, it, unfortunately, does not have a tokenizer in a fast vers.

Therefore, I manually write a function of doing the thing mentioned

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in :hugs: Transformers by setting the labels we wish to ignore to -100 . In the example above, if the label for @HuggingFace is 3 (indexing B-corporation ), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100] .

 for i, label in tqdm(enumerate(examples["labels"]),total=len(examples["labels"])):
        steps=[] 
        batch=0
        for index,value in enumerate(examples['token'][i]):
            len_to_compare=len(tokenizer.tokenize(value))
            if len_to_compare>1:
                steps+=(list(range(index+batch+1,index+batch+len_to_compare)))
                batch+=(len_to_compare-1)

I just easily store the array of indexes that should be ignored by the above function, however, my result did get worse.

Screen Shot 2021-05-17 at 22.29.09

I am not able to follow your version of encode_tags() and I have no way to test it so I can’t verify it. However, I don’t see anything setting a value to -100. Are you sure your loop is working correctly?

1 Like

Here is my full code of creating the labels for training, I did manually check my generated labels and these were correct.

def tokenize_and_align_labels(examples, tokenizer):
    tokenized_inputs = tokenizer(examples["token"], truncation=True, is_split_into_words=True)
    label_to_id = dict(map(reversed, enumerate(label_list)))
    labels = []
           
    for i, label in tqdm(enumerate(examples["labels"]),total=len(examples["labels"])):
        steps=[]
        batch=0
        for index,value in enumerate(examples['token'][i]):
            len_to_compare=len(tokenizer.tokenize(value))
            if len_to_compare>1:
                steps+=(list(range(index+batch+1,index+batch+len_to_compare)))
                batch+=(len_to_compare-1)
        word_ids = list(range(-1,len(label)+len(steps)+1))
        word_ids[0]=None
        word_ids[-1]=None
        #https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
        for idx in steps:
            word_ids[idx+1]=None
        ##set_sub_tokesn as None
        previous_word_idx = None
        label_ids = []
        index_to_look_up_label=0
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label_to_id[label[index_to_look_up_label]])
                index_to_look_up_label+=1
                                
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
  • Input sentence: “
 thị_tráș„n an_phĂș huyện an phĂștỉnh an_giang”
  • Value after tokenizer:
    {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    ‘input_ids’: [0,135,64051,2677,64049,64008,64020,30519,64998,2677,64046,2],
    ‘labels’: [-100, 2, 0, 1, -100, 1, 1, 1, -100, 1, -100, -100],
    ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
  • Because tokenize of “an_phĂș” is [‘an_@@’, ‘phĂș’]; of “tĂąn_bĂŹnh” is [‘phĂș@@’, ‘tỉnh’]; of “an_giang” is [‘an_@@’, ‘giang’’]
  • label_list = [“B-LOC”,“I-LOC”,“O”]

I want to make sure I understand your question. I translate your original text to:
“An Phu Town, An Minute District, An Giang Province”
(btw- why are there underscores in those places? If you are manually tying the words together, is that a common practice?)

So I guess you are expecting something like

*     thị_tráș„n an_phĂș huyện an phĂștỉnh an_giang
*     Thi_tráș„n,  ‘an_@@’,   ‘phĂș’,    huyện,    an,         ‘phĂș@@’,   ‘tỉnh’,     ‘an_@@’,    ‘giang’’
*     "O",       "B-LOC",   "I-LOC",  "I-LOC"   "B-LOC",    "I-LOC"    "I-LOC",    "B-LOC",    "I-LOC"
*     “An Phu Town, An Minute District, An Giang Province”

Is that what you are expecting?

Actually, the input I wrote here was pre-processed with underthesea · PyPI. The original input is 
 thị tráș„n An PhĂș, Huyện An phĂștỉnh An Giang

I do really appreciate your discussion.

According to the HuggingFace’s noted,

@HuggingFace is 3 (indexing B-corporation ), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100] .

So, I think it would be more precise with the following tags,

  • Thi_tráș„n, ‘an_@@’, ‘phĂș’, huyện, an, ‘phĂș@@’, ‘tỉnh’, ‘an_@@’, ‘giang’’
  • “B-LOC”, “I-LOC”, None, “I-LOC”, “I-LOC”, “I-LOC”, None, “I-LOC”, None

In their document, they also declared that the labels we wish to ignore to -100 (labelled as None) .

I think your expectations are a little off. You are expecting 1 long location but in reality I think you should expect 3. If I go to the HF website and test your sentence (translated to english) I get 3 locations.
Go to this link:

You will see the english version of what you are testing with in the testing window. So click “Compute”
You will see that it returns each region in your sentence as a separate a separate location entity.
I am guessing that the tagging in PhoBert is teaching RoBERTa to recognize the locations this way. So, if you want it to recognize the whole thing as one entity, I think you will have to create your own data tagged as you expect and train from the Roberta model just like the team that create PhoBert did. Obviously this would take a lot of work and compute time.

1 Like

Thank you really really much for all of your enthusiasm :grinning:.

In fact, the input “
thị tráș„n An PhĂș, Huyện An phĂștỉnh An Giang” is the full sentence of an address

According to your suggestions, It does mean I need to set the trainer with more epochs to really learn that one pattern.

I just found that we also should

##assign the 'id2label' and 'label2id' model configs
model.config.id2label = id2label
model.config.label2id = label2id