After reading the tutorials as well as the documentation, I thought I know how to train, encode, and decode a sentence using BPE. But when I test it, the tokens of the sample sentence are all put together without any space.
Here’s the code that I have:
import sys
from datasets import load_dataset
from tokenizers import (
models,
normalizers,
pre_tokenizers,
processors,
trainers,
Tokenizer,
decoders,
)
def batch_iterator(dataset, batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="test")
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=100, special_tokens=["[UNK]"])
tokenizer.train_from_iterator(batch_iterator(dataset), trainer=trainer, length=len(dataset))
tokenizer.decoder = decoders.BPEDecoder()
encoding = tokenizer.encode("A sample sentence")
print(encoding.tokens)
print(tokenizer.decode(encoding.ids))
This code prints:
['A', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
Asamplesentence
At this point, I don’t care about the printed tokens (even though the individual characters are a bit concerning). My question is why the decoded version is missing spaces? Am I using a wrong decoder? If I am, then what is the right decode for BPE?