Hi guys,
After training the NER Task with using RoBERTa Architecture, I got the below result
{âeval_lossâ: 0.003242955543100834,
âeval_precisionâ: 0.9959672534053343,
âeval_recallâ: 0.9959672534053343,
âeval_f1â: 0.9959672534053343,
âeval_accuracyâ: 0.9995624335836689}
The result generally is quite high, as I expected. But here is my confusion, when I randomly input a set of sentences (out of the training set) to really know the modelâs performance.
My pseudo code
def tokenize_and_align_labels_random(examples, tokenizer):
tokenized_inputs = tokenizer(examples['tokens'],
truncation=True,
is_split_into_words=True)
return tokenized_inputs
def preprocess_datasets(tokenizer, **datasets) -> Dict[str, Dataset]:
tokenize_ner = partial(tokenize_and_align_labels_random,
tokenizer=tokenizer)
return {k: ds.map(tokenize_ner) for k, ds in datasets.items()}
address=Testing_Dataset[Testing_Dataset['address']==1]['text'].apply(clean_doc).tolist()
da_datasets_random_Test = preprocess_datasets(tokenizer,
test=Dataset.from_dict({'tokens':address}))
results=da_trainer.predict(da_datasets_random_Test['test'])
predictions=results.predictions
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
I input the sentences with some words that donât exist in the tokenizer vocabulary, and the model will handle that part for me by automatically generating their sub token.
That means the âinput_idsâ will generate more token ids for presenting these cases, the problem is their predicted tags will also be increasing (based on how many tokens was delivered to the model).
For instance
- Input sentence: âGiao tĂŽi lĂȘ_lai phưá»ng hai tĂąn_bĂŹnh hcmâ
- Value after tokenizer:
{âinput_idsâ: [0, 64003, 64003, 17489, 6115, 64139, 64151, 64003, 6446, 64313, 1340, 74780, 2], âtoken_type_idsâ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], âattention_maskâ: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} - Because tokenize of âlĂȘ_laiâ is [âlĂȘ@@â, âl@@', âaiâ]; of âtĂąn_bĂŹnhâ is ['tĂąn@@â, âbĂŹnhâ]; of âhcmâ is [âh@@â, âcmâ]
The result I got after all: [âOâ,âOâ,âB-LOCâ,âI-LOCâ,âI-LOCâ,âI-LOCâ,âI-LOCâ,âI-LOCâ,âOâ,âI-LOCâ,âI-LOCâ, âOâ]
In fact, their prediction should only have 7 tags for the input tokens, but now it was more than this. So do guys have any strategies for this (I got one that we can train the tokenizer with more tokens).
I do appreciate your time and sharing.