Add BOS and EOS when encoding a sentence

alexgrishin · August 19, 2022, 1:39pm

I’m training a tokenizer from scratch. I’m using BPE tokenizer with ByteLevel pre-tokenizer.

How do I add [BOS] at the beginning of each sentence and [EOS] at the end? I can do it manually, of course, but is there a way to tell encoder do it automatically? It seems that this tokenizer with this pre-tokenizer do actually add the same token at the end of each sentence (token “Ċ” with token_id=163). I would prefer to have control over how the end of sentence token looks like and its id. How do I do that?

lianghsun · August 22, 2022, 3:07am

Hi @alexgrishin, is the following code what you desire to be?

# I use tutorial code from https://huggingface.co/docs/tokenizers/quicktour as example
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]) #  Adding [BOS] and [EOS] here
tokenizer.pre_tokenizer = Whitespace()

# Please using TemplateProcessing
# https://huggingface.co/docs/tokenizers/api/post-processors#tokenizers.processors.TemplateProcessing
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    special_tokens=[("[BOS]", 1), ("[EOS]", 2)],
)
##################################################

files = [f"datasets/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
# >> [1, 27255, 18, 95, 13, 5099, 7, 7963, 5114, 6220, 0, 37, 2] <-- you can see there are token [1] in the begining and token [2] at the end of the sequence
print(tokenizer.decode(output.ids))
# >> no [BOS] and [EOS] after decoding

alexgrishin · August 22, 2022, 5:18pm

Thank you! That’s exactly what I was after!

Topic		Replies	Views
GPT2Tokenizer not putting bos/eos token Intermediate	3	5476	March 31, 2024
BOS tokens for mBERT tokenizer 🤗Tokenizers	1	634	April 14, 2021
How to add EOS when training T5? Intermediate	1	137	October 21, 2024
Preprocessing raw text 🤗Tokenizers	2	595	October 26, 2022
How to make GPT2 Tokenizer actually add special tokens Beginners	4	3017	February 28, 2025

Add BOS and EOS when encoding a sentence

Related topics