How to know if a subtoken is a word or part of a word?

For example, using BERT in a token classification task, I get something like this …

[('Darüber', 17), ('hinaus', 17), ('fanden', 17), ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100), ('eine', 17), ('Ver', 17), ('##legung', -100), ('##sli', -100), ('##ste', -100), (',', 17), ('die', 17), ('bestätigt', 17), (',', 17), ('dass', 17), ('Dem', 8), ('##jan', -100), ('##juk', -100), ('am', 17), ('27', 17), ('.', -100), ('März', 17), ('1943', 17), ('an', 17), ('die', 17), ('Dienst', 17), ('##stelle', -100), ('So', 0), ('##bi', -100), ('##bor', -100), ('ab', 17), ('##kom', -100), ('##mand', -100), ('##iert', -100), ('wurde', 17), ('.', -100)]

… in the format of (sub-token, label id).

Is there a way I can automatically know that “##mitt” and “##ler” are part of “Er” (thus making up the word “Ermittler”) that would work across all tokenizers (not just BERT)?

what do you mean by “automatically know”?

I guess you already know that ##xx tokens are continuation tokens.

I don’t think it is possible to detect from (‘Er’, 17) that it has a continuation.

If you feed the data into an untrained Bert model, [I think] the embedding layer will create an embedding vector for (‘Er’,17) that does not depend on the continuation tokens.
If you feed the data into a trained Bert model, the embedding layer might create different embedding vectors for different instances of (‘Er’, 17), depending on their context, which includes depending on any continuation tokens.

There is a nice tutorial by Chris McCormick here that discusses embeddings in more detail.

If you were asking about something else, please clarify.

[I am not an expert, and I could be wrong]

I mean programmatically be able to determine if a subtoken is “part” of a word or a word itself, regardless of the tokenizer. So given this:

... ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100) ...

A function that uses either the subtoken or the id, to return True/False if it is part of a word or a word itself …

tokenizer.is_token_part('die') #    => False
tokenizer.is_token_part('Er') #     => False
tokenizer.is_token_part('##mitt') # => True
tokenizer.is_token_part('##ler') #  => True

And again, this would work across all tokenizers.

I think that is impossible. For example, (‘Er’) will sometimes be a whole word and sometimes be part of Ermittler. If your function only sees ‘Er’ or ‘17’, it can’t know which is true.

On the other hand, if your function sees the original text as well, it can easily detect whether it is a whole word or not. Different tokenizers will split words in different ways, but any function that sees both the whole text and the tokenized text could comment on it.

Why do you need such a function?

For reassembling token classification predictions, where given the 4 subtokens above there would only be two predictions (one for “die” and one for “Ermittler”)

Oh, I see, thanks.

I think you will have to look at the next token (##mitt) to know the answer for Er.

This doesn’t entirely answer your question, but one arg that might be helpful is the return_offset_mappings arg which you can pass to tokenizer and which will return the character offset from the original sequence for each token. What you could do is just do some kind of simple (non-subword) tokenization first, and then pass the resulting tokens to the subword tokenizer with return_offset_mappings=True and is_pretokenized=True. E.g.,

encodings = tokenizer(['die', 'Ermittler'], is_pretokenized=True, return_offsets_mapping=True)
# ['[CLS]', 'die', 'er', '##mit', '##tler', '[SEP]']
# [(0, 0), (0, 3), (0, 2), (2, 5), (5, 9), (0, 0)]
is_subword = np.array(encodings['offset_mapping'])[:,0] != 0
# array([False, False, False,  True,  True, False])

Note that you need to be using the fast tokenizers to use this feature, e.g. BertTokenizerFast.

There’s an example of this in the NER example of our custom datasets tutorial in (and in the paragraph before) the encode_tags function definition.

Not a complete solution but hope that helps.


Ah thanks … that is exactly what I was looking for.

Is this going to get into the standard tokenizers as well? If not, 1) why and 2) when should folks choose one over the other (e.g, the “fast” vs. the standard tokenizers)?

Thanks again!

I think the short answer is that the the fast (Rust-based) tokenizers are newer and will at some point completely replace the Python-based tokenizers. They are much faster and I’m not sure if there’s a good reason not to use them at this point, but I’m not the expert there. cc @mfuntowicz

1 Like