Predicting with Token Classifier on data with no gold labels

Hello everyone,

I am implementing a token classification model, following the example in the github repo (transformers/ at master · huggingface/transformers · GitHub). I have adapted it for my particular task, and I can train and test a model on data for which I have gold labels. Now I want to use the same model to predict labels for data without gold labels.

In the sample code, the tokenize_and_align_labels function gives a label of -100 to special tokens and also to tokens within a word that are not the first (according to some parameter).

def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        labels = []
        for i, label in enumerate(examples[label_column_name]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                # Special tokens have a word id that is None. We set the label to -100 so they are automatically
                # ignored in the loss function.
                if word_idx is None:
                # We set the label for the first token of each word.
                elif word_idx != previous_word_idx:
                # For the other tokens in a word, we set the label to either the current label or -100, depending on
                # the label_all_tokens flag.
                    label_ids.append(label_to_id[label[word_idx]] if data_args.label_all_tokens else -100)
                previous_word_idx = word_idx

        tokenized_inputs["labels"] = labels
        return tokenized_inputs

Now, this works well if the data that’s being preprocessed has labels. But what should be done when these do not exist?

In order to get predictions, we could simply not have a labels key in tokenized_inputs. However, the info in tokenized_inputs["labels"] (i.e. which tokens have a -100 label) is later used to retrieve the predicted label per word, ignoring the predictions for the special tokens and other tokens within a word but the first (which is correct).

# Remove ignored index (special tokens)
true_predictions = [
         [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
         for prediction, label in zip(predictions, labels)

What do you think would be the best way to handle this case? Maybe in tokenize_and_align_labels, the “true” tokens for which we want labels could have another value in tokenized_inputs["labels"]? Or maybe the post-processing to remove predictions on special tokens should not rely on the labels in the first place?

Any help you could provide would be welcome. Thanks!


If you’re using a fast tokenizer (such as BertTokenizerFast), you can add use the offsets to know if a token is a special token/the first wordpiece of a word or not. Small example:

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "hello my name is niels"

encoding = tokenizer(text, return_offsets_mapping=True)

for id, offset in zip(encoding.input_ids, encoding.offset_mapping):
  print(tokenizer.decode([id]), offset)

This returns:

[CLS] (0, 0)
hello (0, 5)
my (6, 8)
name (9, 13)
is (14, 16)
ni (17, 19)
##els (19, 22)
[SEP] (0, 0)

As you can see, the offsets for special tokens are (0, 0), and if offset[0] of a particular token is equal to offset[1] of the previous token, then we know that it’s a subword token that’s not the first one of a word.

You can write this in a (rather long) list comprehension, to filter the predictions:

true_indices = [1] + [idx for idx, offset in enumerate(encoding.offset_mapping) if offset != (0, 0) and offset[0] != encoding.offset_mapping[idx-1][1]]
true_predictions = predictions.numpy()[true_indices]