Hello everyone,
I am implementing a token classification model, following the example in the github repo (transformers/run_ner.py at master · huggingface/transformers · GitHub). I have adapted it for my particular task, and I can train and test a model on data for which I have gold labels. Now I want to use the same model to predict labels for data without gold labels.
In the sample code, the tokenize_and_align_labels
function gives a label of -100 to special tokens and also to tokens within a word that are not the first (according to some parameter).
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(
examples[text_column_name],
padding=padding,
truncation=True,
max_length=data_args.max_seq_length,
# We use this argument because the texts in our dataset are lists of words (with a label for each word).
is_split_into_words=True,
)
labels = []
for i, label in enumerate(examples[label_column_name]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
# Special tokens have a word id that is None. We set the label to -100 so they are automatically
# ignored in the loss function.
if word_idx is None:
label_ids.append(-100)
# We set the label for the first token of each word.
elif word_idx != previous_word_idx:
label_ids.append(label_to_id[label[word_idx]])
# For the other tokens in a word, we set the label to either the current label or -100, depending on
# the label_all_tokens flag.
else:
label_ids.append(label_to_id[label[word_idx]] if data_args.label_all_tokens else -100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
Now, this works well if the data that’s being preprocessed has labels. But what should be done when these do not exist?
In order to get predictions, we could simply not have a labels
key in tokenized_inputs
. However, the info in tokenized_inputs["labels"]
(i.e. which tokens have a -100 label) is later used to retrieve the predicted label per word, ignoring the predictions for the special tokens and other tokens within a word but the first (which is correct).
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
What do you think would be the best way to handle this case? Maybe in tokenize_and_align_labels
, the “true” tokens for which we want labels could have another value in tokenized_inputs["labels"]
? Or maybe the post-processing to remove predictions on special tokens should not rely on the labels in the first place?
Any help you could provide would be welcome. Thanks!