Adding atomic / indivisible tokens to BPE tokenizer

John6666 · July 2, 2025, 11:43am

Special tokens seem to be considered atomic. However, the implementation of special tokens is quite complex (it has been revised and changed over a long period of time), so it would be safer to search for information while working on it.

github.com/huggingface/tokenizers

Correct way of adding special tokens before training a tokenizer

opened 02:45AM - 21 Apr 22 UTC

closed 02:03AM - 27 Apr 22 UTC

marcmk6

Hi, I want to train a tokenizer with code like the following ``` # I am not… sure about the correct way, so I try to add '<sep>' in every possible way. trainer = BpeTrainer(special_tokens=["<unk>", "<pad>", '<sep>'], vocab_size=vocab_size, ) tokenizer.add_special_tokens(['<sep>']) tokenizer.add_tokens(['<sep>']) tokenizer.train_from_iterator(trainer=trainer, iterator=iterator_over_seqs) ``` An example sequence is `ABCD<sep>EFGH`. However the trained vocabulary contains token `'<'`, `'>'` and `'e'`, `'ep'`, `'p'`, `'s'`, `'sep'`, which are undesired. So I'm wondering what should I do to let the tokenizer taking the `'<sep>'` as a single special token? Thanks

Topic		Replies	Views
Are special_tokens the only tokens guaranteed to be atomic? 🤗Tokenizers	0	374	March 3, 2021
Add BOS and EOS when encoding a sentence 🤗Tokenizers	2	14605	August 22, 2022
Get intermediate tokens and merges used in tokenization 🤗Tokenizers	0	469	December 1, 2023
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	605	May 5, 2021
How to add all standard special tokens to my tokenizer and model? Beginners	1	5895	August 11, 2022

Adding atomic / indivisible tokens to BPE tokenizer

Related topics