How to reconstruct a sentence after it is encoded using BPE?

mapama247 · April 18, 2023, 2:20pm

I think the issue has to do with the encoding part… as you mention it is concerning that it is assigning an individual token for every single letter, but also it’s in that step where you’re “losing” the spaces.

If you look at your sequence of tokens with print(encoding.ids)
you’ll get [33, 80, 62, 74, 77, 73, 66, 80, 66, 75, 81, 66, 75, 64, 66]

Then decoding them one by one you will see that, for example, 33 is the “A” and 80 the “s”, which means that at this point the spaces are already gone.

You can try with any model from the hub for comparison:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("roberta-base")
t.encode("This is a test")

This returns [0, 713, 16, 10, 1296, 2]. Ignoring the initial and final ids which correspond to the BOS and EOS tokens, you can see that the other ones preserve the spaces.

For instance, decoding the 16 (t.decode([16])) you can see that it maps to " is", which is different from “is” without the whitespace (that would be the token_id 354 for roberta-base). This is what the Ġ chartacter is used for in BPE tokenizers, to indicate that a token belongs to the beginning of a word.

So I don’t know what are you trying to do with the encoder but I’d say that the problem is there rather than with de decoder

Topic		Replies	Views
BPEDecoder no spaces after special tokens Intermediate	4	2070	April 19, 2023
`add_prefix_space=True` option for the BPE tokenizer 🤗Transformers	0	1753	October 19, 2020
How do you use SentencePiece for BPE of sequences with no whitespace 🤗Tokenizers	1	2108	April 29, 2021
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3858	July 17, 2020
BpeTrainer implementation in Python 🤗Tokenizers	0	379	July 23, 2021

How to reconstruct a sentence after it is encoded using BPE?

Related topics