How to reconstruct a sentence after it is encoded using BPE?

After reading the tutorials as well as the documentation, I thought I know how to train, encode, and decode a sentence using BPE. But when I test it, the tokens of the sample sentence are all put together without any space.

Here’s the code that I have:

import sys
from datasets import load_dataset
from tokenizers import (

def batch_iterator(dataset, batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="test")
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(vocab_size=100, special_tokens=["[UNK]"])
tokenizer.train_from_iterator(batch_iterator(dataset), trainer=trainer, length=len(dataset))

tokenizer.decoder = decoders.BPEDecoder()

encoding = tokenizer.encode("A sample sentence")

This code prints:

['A', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

At this point, I don’t care about the printed tokens (even though the individual characters are a bit concerning). My question is why the decoded version is missing spaces? Am I using a wrong decoder? If I am, then what is the right decode for BPE?

I think the issue has to do with the encoding part… as you mention it is concerning that it is assigning an individual token for every single letter, but also it’s in that step where you’re “losing” the spaces.

If you look at your sequence of tokens with print(encoding.ids)
you’ll get [33, 80, 62, 74, 77, 73, 66, 80, 66, 75, 81, 66, 75, 64, 66]

Then decoding them one by one you will see that, for example, 33 is the “A” and 80 the “s”, which means that at this point the spaces are already gone.

You can try with any model from the hub for comparison:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("roberta-base")
t.encode("This is a test")

This returns [0, 713, 16, 10, 1296, 2]. Ignoring the initial and final ids which correspond to the BOS and EOS tokens, you can see that the other ones preserve the spaces.

For instance, decoding the 16 (t.decode([16])) you can see that it maps to " is", which is different from “is” without the whitespace (that would be the token_id 354 for roberta-base). This is what the Ġ chartacter is used for in BPE tokenizers, to indicate that a token belongs to the beginning of a word.

So I don’t know what are you trying to do with the encoder but I’d say that the problem is there rather than with de decoder :man_shrugging:

Thanks, @mapama247 for the info. I think I found the issue myself. Even though I’m not 100% sure.

The problem in my code has to do with my choice of pre_tokenizer. Apparently, in BPE, space is considered a character and you should not eliminate it from your input text so it can be part of our dictionary. Right now, I’m using this pre tokenizer:

tokenizer.pre_tokenizer = pre_tokenizers.Split("\n", "removed")

And it works and now, I’m only excluding \n from my dictionary. But again, I’m not 100% confident that this is the right way. I’ll do more research to make sure.