Add BOS and EOS when encoding a sentence

I’m training a tokenizer from scratch. I’m using BPE tokenizer with ByteLevel pre-tokenizer.

How do I add [BOS] at the beginning of each sentence and [EOS] at the end? I can do it manually, of course, but is there a way to tell encoder do it automatically? It seems that this tokenizer with this pre-tokenizer do actually add the same token at the end of each sentence (token “Ċ” with token_id=163). I would prefer to have control over how the end of sentence token looks like and its id. How do I do that?

Hi @alexgrishin, is the following code what you desire to be?

# I use tutorial code from https://huggingface.co/docs/tokenizers/quicktour as example
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]) #  Adding [BOS] and [EOS] here
tokenizer.pre_tokenizer = Whitespace()

# Please using TemplateProcessing
# https://huggingface.co/docs/tokenizers/api/post-processors#tokenizers.processors.TemplateProcessing
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    special_tokens=[("[BOS]", 1), ("[EOS]", 2)],
)
##################################################

files = [f"datasets/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
# >> [1, 27255, 18, 95, 13, 5099, 7, 7963, 5114, 6220, 0, 37, 2] <-- you can see there are token [1] in the begining and token [2] at the end of the sequence
print(tokenizer.decode(output.ids))
# >> no [BOS] and [EOS] after decoding
2 Likes

Thank you! That’s exactly what I was after!

1 Like