How to know if a subtoken is a word or part of a word?

For example, using BERT in a token classification task, I get something like this ā€¦

[('DarĆ¼ber', 17), ('hinaus', 17), ('fanden', 17), ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100), ('eine', 17), ('Ver', 17), ('##legung', -100), ('##sli', -100), ('##ste', -100), (',', 17), ('die', 17), ('bestƤtigt', 17), (',', 17), ('dass', 17), ('Dem', 8), ('##jan', -100), ('##juk', -100), ('am', 17), ('27', 17), ('.', -100), ('MƤrz', 17), ('1943', 17), ('an', 17), ('die', 17), ('Dienst', 17), ('##stelle', -100), ('So', 0), ('##bi', -100), ('##bor', -100), ('ab', 17), ('##kom', -100), ('##mand', -100), ('##iert', -100), ('wurde', 17), ('.', -100)]

ā€¦ in the format of (sub-token, label id).

Is there a way I can automatically know that ā€œ##mittā€ and ā€œ##lerā€ are part of ā€œErā€ (thus making up the word ā€œErmittlerā€) that would work across all tokenizers (not just BERT)?

what do you mean by ā€œautomatically knowā€?

I guess you already know that ##xx tokens are continuation tokens.

I donā€™t think it is possible to detect from (ā€˜Erā€™, 17) that it has a continuation.

If you feed the data into an untrained Bert model, [I think] the embedding layer will create an embedding vector for (ā€˜Erā€™,17) that does not depend on the continuation tokens.
If you feed the data into a trained Bert model, the embedding layer might create different embedding vectors for different instances of (ā€˜Erā€™, 17), depending on their context, which includes depending on any continuation tokens.

There is a nice tutorial by Chris McCormick here https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ that discusses embeddings in more detail.

If you were asking about something else, please clarify.

[I am not an expert, and I could be wrong]

I mean programmatically be able to determine if a subtoken is ā€œpartā€ of a word or a word itself, regardless of the tokenizer. So given this:

... ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100) ...

A function that uses either the subtoken or the id, to return True/False if it is part of a word or a word itself ā€¦

tokenizer.is_token_part('die') #    => False
tokenizer.is_token_part('Er') #     => False
tokenizer.is_token_part('##mitt') # => True
tokenizer.is_token_part('##ler') #  => True

And again, this would work across all tokenizers.

1 Like

I think that is impossible. For example, (ā€˜Erā€™) will sometimes be a whole word and sometimes be part of Ermittler. If your function only sees ā€˜Erā€™ or ā€˜17ā€™, it canā€™t know which is true.

On the other hand, if your function sees the original text as well, it can easily detect whether it is a whole word or not. Different tokenizers will split words in different ways, but any function that sees both the whole text and the tokenized text could comment on it.

Why do you need such a function?

For reassembling token classification predictions, where given the 4 subtokens above there would only be two predictions (one for ā€œdieā€ and one for ā€œErmittlerā€)

1 Like

Oh, I see, thanks.

I think you will have to look at the next token (##mitt) to know the answer for Er.

This doesnā€™t entirely answer your question, but one arg that might be helpful is the return_offset_mappings arg which you can pass to tokenizer and which will return the character offset from the original sequence for each token. What you could do is just do some kind of simple (non-subword) tokenization first, and then pass the resulting tokens to the subword tokenizer with return_offset_mappings=True and is_pretokenized=True. E.g.,

encodings = tokenizer(['die', 'Ermittler'], is_pretokenized=True, return_offsets_mapping=True)
print(tokenizer.convert_ids_to_tokens(encodings['input_ids']))
# ['[CLS]', 'die', 'er', '##mit', '##tler', '[SEP]']
print(encodings['offset_mapping'])
# [(0, 0), (0, 3), (0, 2), (2, 5), (5, 9), (0, 0)]
is_subword = np.array(encodings['offset_mapping'])[:,0] != 0
print(is_subword)
# array([False, False, False,  True,  True, False])

Note that you need to be using the fast tokenizers to use this feature, e.g. BertTokenizerFast.

Thereā€™s an example of this in the NER example of our custom datasets tutorial in (and in the paragraph before) the encode_tags function definition.

Not a complete solution but hope that helps.

5 Likes

Ah thanks ā€¦ that is exactly what I was looking for.

Is this going to get into the standard tokenizers as well? If not, 1) why and 2) when should folks choose one over the other (e.g, the ā€œfastā€ vs. the standard tokenizers)?

Thanks again!

I think the short answer is that the the fast (Rust-based) tokenizers are newer and will at some point completely replace the Python-based tokenizers. They are much faster and Iā€™m not sure if thereā€™s a good reason not to use them at this point, but Iā€™m not the expert there. cc @mfuntowicz

1 Like

There is a word_ids key of the returned BatchEncoding that exactly does this.

A list indicating the word corresponding to each token.

But it looks like to be available only in PreTrainedTokenizerFast. I didnā€™t find this word_ids in the returned value when using PreTrainedTokenizer.

The documentation says that this kind of mapping back to the original texts is only available in the fast version:

When the tokenizer is a ā€œFastā€ tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).

1 Like

The embedded link is no longer accessible. Can you please share the new working link for custom datasets tutorial?