How to know if a subtoken is a word or part of a word?

wgpubs · August 30, 2020, 1:34am

For example, using BERT in a token classification task, I get something like this …

[('Darüber', 17), ('hinaus', 17), ('fanden', 17), ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100), ('eine', 17), ('Ver', 17), ('##legung', -100), ('##sli', -100), ('##ste', -100), (',', 17), ('die', 17), ('bestätigt', 17), (',', 17), ('dass', 17), ('Dem', 8), ('##jan', -100), ('##juk', -100), ('am', 17), ('27', 17), ('.', -100), ('März', 17), ('1943', 17), ('an', 17), ('die', 17), ('Dienst', 17), ('##stelle', -100), ('So', 0), ('##bi', -100), ('##bor', -100), ('ab', 17), ('##kom', -100), ('##mand', -100), ('##iert', -100), ('wurde', 17), ('.', -100)]

… in the format of (sub-token, label id).

Is there a way I can automatically know that “##mitt” and “##ler” are part of “Er” (thus making up the word “Ermittler”) that would work across all tokenizers (not just BERT)?

rgwatwormhill · August 30, 2020, 11:16am

what do you mean by “automatically know”?

I guess you already know that ##xx tokens are continuation tokens.

I don’t think it is possible to detect from (‘Er’, 17) that it has a continuation.

If you feed the data into an untrained Bert model, [I think] the embedding layer will create an embedding vector for (‘Er’,17) that does not depend on the continuation tokens.
If you feed the data into a trained Bert model, the embedding layer might create different embedding vectors for different instances of (‘Er’, 17), depending on their context, which includes depending on any continuation tokens.

There is a nice tutorial by Chris McCormick here https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ that discusses embeddings in more detail.

If you were asking about something else, please clarify.

[I am not an expert, and I could be wrong]

wgpubs · August 30, 2020, 6:00pm

I mean programmatically be able to determine if a subtoken is “part” of a word or a word itself, regardless of the tokenizer. So given this:

... ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100) ...

A function that uses either the subtoken or the id, to return True/False if it is part of a word or a word itself …

tokenizer.is_token_part('die') #    => False
tokenizer.is_token_part('Er') #     => False
tokenizer.is_token_part('##mitt') # => True
tokenizer.is_token_part('##ler') #  => True

And again, this would work across all tokenizers.

rgwatwormhill · August 30, 2020, 7:41pm

I think that is impossible. For example, (‘Er’) will sometimes be a whole word and sometimes be part of Ermittler. If your function only sees ‘Er’ or ‘17’, it can’t know which is true.

On the other hand, if your function sees the original text as well, it can easily detect whether it is a whole word or not. Different tokenizers will split words in different ways, but any function that sees both the whole text and the tokenized text could comment on it.

Why do you need such a function?

wgpubs · August 30, 2020, 9:25pm

For reassembling token classification predictions, where given the 4 subtokens above there would only be two predictions (one for “die” and one for “Ermittler”)

rgwatwormhill · August 31, 2020, 10:12am

Oh, I see, thanks.

I think you will have to look at the next token (##mitt) to know the answer for Er.

joeddav · August 31, 2020, 2:31pm

This doesn’t entirely answer your question, but one arg that might be helpful is the return_offset_mappings arg which you can pass to tokenizer and which will return the character offset from the original sequence for each token. What you could do is just do some kind of simple (non-subword) tokenization first, and then pass the resulting tokens to the subword tokenizer with return_offset_mappings=True and is_pretokenized=True. E.g.,

encodings = tokenizer(['die', 'Ermittler'], is_pretokenized=True, return_offsets_mapping=True)
print(tokenizer.convert_ids_to_tokens(encodings['input_ids']))
# ['[CLS]', 'die', 'er', '##mit', '##tler', '[SEP]']
print(encodings['offset_mapping'])
# [(0, 0), (0, 3), (0, 2), (2, 5), (5, 9), (0, 0)]
is_subword = np.array(encodings['offset_mapping'])[:,0] != 0
print(is_subword)
# array([False, False, False,  True,  True, False])

Note that you need to be using the fast tokenizers to use this feature, e.g. BertTokenizerFast.

There’s an example of this in the NER example of our custom datasets tutorial in (and in the paragraph before) the encode_tags function definition.

Not a complete solution but hope that helps.

wgpubs · September 1, 2020, 8:37pm

Ah thanks … that is exactly what I was looking for.

Is this going to get into the standard tokenizers as well? If not, 1) why and 2) when should folks choose one over the other (e.g, the “fast” vs. the standard tokenizers)?

Thanks again!

joeddav · September 1, 2020, 11:29pm

I think the short answer is that the the fast (Rust-based) tokenizers are newer and will at some point completely replace the Python-based tokenizers. They are much faster and I’m not sure if there’s a good reason not to use them at this point, but I’m not the expert there. cc @mfuntowicz

wydwww · December 17, 2021, 8:25am

There is a word_ids key of the returned BatchEncoding that exactly does this.

A list indicating the word corresponding to each token.

But it looks like to be available only in PreTrainedTokenizerFast. I didn’t find this word_ids in the returned value when using PreTrainedTokenizer.

The documentation says that this kind of mapping back to the original texts is only available in the fast version:

When the tokenizer is a “Fast” tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).

Depshad · August 29, 2022, 9:38am

The embedded link is no longer accessible. Can you please share the new working link for custom datasets tutorial?

Topic		Replies	Views
Is 512 token in bert, token or character level? Beginners	3	9250	April 4, 2022
Inference API - Sub-words display for Token Classification 🤗Hub	0	374	June 25, 2023
How do we reassemble sub tokens when running a token classification model in inference with a sentence? 🤗Transformers	2	817	January 4, 2023
Word Specific Classification (custom token classification?) Beginners	0	76	May 28, 2024
Word Specific Classification (custom token classification) Research	0	151	May 30, 2024

How to know if a subtoken is a word or part of a word?

Related topics