How to reconstruct a sentence after it is encoded using BPE?

I think the issue has to do with the encoding part… as you mention it is concerning that it is assigning an individual token for every single letter, but also it’s in that step where you’re “losing” the spaces.

If you look at your sequence of tokens with print(encoding.ids)
you’ll get [33, 80, 62, 74, 77, 73, 66, 80, 66, 75, 81, 66, 75, 64, 66]

Then decoding them one by one you will see that, for example, 33 is the “A” and 80 the “s”, which means that at this point the spaces are already gone.

You can try with any model from the hub for comparison:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("roberta-base")
t.encode("This is a test")

This returns [0, 713, 16, 10, 1296, 2]. Ignoring the initial and final ids which correspond to the BOS and EOS tokens, you can see that the other ones preserve the spaces.

For instance, decoding the 16 (t.decode([16])) you can see that it maps to " is", which is different from “is” without the whitespace (that would be the token_id 354 for roberta-base). This is what the Ġ chartacter is used for in BPE tokenizers, to indicate that a token belongs to the beginning of a word.

So I don’t know what are you trying to do with the encoder but I’d say that the problem is there rather than with de decoder :man_shrugging: