Adding atomic / indivisible tokens to BPE tokenizer

GuyShur · July 2, 2025, 7:55am

Hi all,

I am trying to create a BPE tokenizer that has added “atomic” tokens before training, meaning words of length > 1 that should always be tokenized as one indivisible unit (but possibly merged with other words).

For example, my input text can look like: “HKLDCPHY<RR><QQ>PDHGIVMN”, where <RR> and <QQ> must be treated essentially as single characters. After tokenization they can be their own tokens or merged to something like PHY<RR> or <RR><QQ> but they should never be split (something like [“PHY”, “<RR”, “><”, “QQ”, “>”, “PDH”,…] is bad, for example).

Is there a way to do this?

I would encode the important substrings to single chars but I have a lot of them and past the ASCII characters I am sometimes getting “Â” appended to tokens containing the rest of the characters, plus I really want the tokens to be Interpretable.

Thanks in advance.

John6666 · July 2, 2025, 11:43am

Special tokens seem to be considered atomic. However, the implementation of special tokens is quite complex (it has been revised and changed over a long period of time), so it would be safer to search for information while working on it.

github.com/huggingface/tokenizers

Correct way of adding special tokens before training a tokenizer

opened 02:45AM - 21 Apr 22 UTC

closed 02:03AM - 27 Apr 22 UTC

marcmk6

Hi, I want to train a tokenizer with code like the following ``` # I am not… sure about the correct way, so I try to add '<sep>' in every possible way. trainer = BpeTrainer(special_tokens=["<unk>", "<pad>", '<sep>'], vocab_size=vocab_size, ) tokenizer.add_special_tokens(['<sep>']) tokenizer.add_tokens(['<sep>']) tokenizer.train_from_iterator(trainer=trainer, iterator=iterator_over_seqs) ``` An example sequence is `ABCD<sep>EFGH`. However the trained vocabulary contains token `'<'`, `'>'` and `'e'`, `'ep'`, `'p'`, `'s'`, `'sep'`, which are undesired. So I'm wondering what should I do to let the tokenizer taking the `'<sep>'` as a single special token? Thanks

Felicitywood · July 2, 2025, 11:51am

If I add atomic tokens as special ones, will it stop them from being split?

NuralNexus · July 2, 2025, 1:02pm

Yes what you want is a (BPE) tokenizer with added pre-tokenized atomic substrings that Are treated as indivisible units (not split at any point), Can be merged with surrounding tokens (e.g. PHY),
Are multi-character, but always considered atomic,
Stay interpretable (not replaced with Unicode hacks or private-use glyphs). Here is how to do this cleanly with modern tokenization tools particularly tokenizers (Rust-backed and flexible). That reserves a vocabulary of predefined atomic substrings
And learns merges on top of those atomic substrings Step-by-Step Implementation with tokenizers Define Your Atomic Units. atomic_units = [“”, “”, “”, “”, “HKLDCPHY”] These must never be split even if they’re multi-character. Pre-tokenize Your Corpus with Atomic Units as Special Tokens
Before training, wrap your corpus in a custom pre-tokenizer that, Scans input for atomic units,
Inserts them as single “tokens” before BPE starts learning. from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders
from tokenizers.pre_tokenizers import PreTokenizer
import re

class AtomicUnitPreTokenizer:
def init(self, atomic_units):
self.atomic_units = sorted(atomic_units, key=len, reverse=True)
self.pattern = re.compile(“|”.join(re.escape(a) for a in self.atomic_units))

def pre_tokenize_str(self, input_str):
    def _split(match):
        # Return the atomic unit as its own token
        return f" {match.group(0)} "

    # Inject spaces around atomic units so tokenizers can treat them as isolated
    split_str = self.pattern.sub(_split, input_str)
    # Then split on whitespace
    return [(token, (0, 0)) for token in split_str.strip().split()] plug this into HuggingFace’s tokenizer pipeline Initialize and Train the Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

Your atomic units

atomic_units = [“”, “”, “”, “”]

Tokenizer with custom pre-tokenizer

tokenizer = Tokenizer(BPE(unk_token=“[UNK]”))
tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)

trainer = BpeTrainer(vocab_size=1000, special_tokens=[“[UNK]”] + atomic_units)

Corpus containing your sequences

corpus = [“HKLDCPHYPDHGIVMN”, “ABCDE”]
tokenizer.train_from_iterator(corpus, trainer=trainer) Test the Tokenization
encoded = tokenizer.encode(“HKLDCPHYPDHGIVMN”)
print(encoded.tokens)
Good output:

[‘HKLDCPHY’, ‘’, ‘’, ‘PDHGIVMN’]

Or even: [‘PHY’, ‘’, ‘’, ‘PDH’] — as long as and aren’t split

Bad output:

[‘<’, ‘RR’, ‘>’, ‘<’, ‘QQ’, ‘>’]
That’s should be a way to successfully forced atomic tokens into the model without hacking the Unicode space or losing interpretability. The key is treating “” and others as “pre-tokenized” units before training. You don’t need to substitute in weird ASCII or Unicode you’re just teaching the tokenizer to respect and treat them as atomic substrings.
This integrates well with HuggingFace transformers via PreTrainedTokenizerFast. Add to Special Tokens If you want , , etc. to be treated as special tokens later (e.g. for generation), you can
tokenizer.add_special_tokens([“”, “”, “”, “”])
If you just look to converting to transformers tokenizer
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({“additional_special_tokens”: atomic_units})
You don’t need to encode them into Unicode glyphs or sacrifice interpretability.
You just need: A custom pre-tokenizer that isolates atomic unitsA BPE tokenizer trained on top of those preserved tokens Add to special_tokens for further control

GuyShur · July 3, 2025, 9:11am

Thanks @John6666,

Adding these as special tokens works to keep them from fragmenting, but I would actually like these to participate in merges, and special tokens don’t do that. Is there a way to achieve this outcome?

GuyShur · July 3, 2025, 9:27am

Thank you @NuralNexus!

I tried to recreate your code because it’s a bit mangled and this is what I have:

from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders
from tokenizers.pre_tokenizers import PreTokenizer
import re


class AtomicUnitPreTokenizer:
    def __init__(self, atomic_units):
        self.atomic_units = sorted(atomic_units, key=len, reverse=True)
        # Create regex pattern that matches any of the atomic units
        self.pattern = re.compile("|".join(re.escape(a) for a in self.atomic_units))

    def pre_tokenize_str(self, input_str):
        tokens = []
        last_end = 0

        # Find all atomic units in the string
        for match in self.pattern.finditer(input_str):
            start, end = match.span()

            # Add any text before the atomic unit
            if start > last_end:
                before_text = input_str[last_end:start]
                if before_text:
                    tokens.append((before_text, (last_end, start)))

            # Add the atomic unit as a single token
            tokens.append((match.group(0), (start, end)))
            last_end = end

        # Add any remaining text after the last atomic unit
        if last_end < len(input_str):
            remaining_text = input_str[last_end:]
            if remaining_text:
                tokens.append((remaining_text, (last_end, len(input_str))))

        return tokens


# Define your atomic units - these will never be split
atomic_units = ["<RR>", "<QQ>", "<PP>", "<XX>"]

# Initialize tokenizer with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Set up the pre-tokenizer to handle atomic units
tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)

# Set up the trainer
trainer = trainers.BpeTrainer(
    vocab_size=1000, special_tokens=["[UNK]"] + atomic_units, min_frequency=2
)

# Example corpus containing your sequences
corpus = [
    "HKLDCPHY<RR><QQ>PDHGIVMN",
    "ABCDE<PP><XX>FGHIJK",
    "MNOP<RR>QRST<QQ>UVWX",
    "YZAB<PP>CDEF<XX>GHIJ",
]

corpus += ["<QQ><RR>"] * 100

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer=trainer)

# Test the tokenization
test_string = "HKLDCPHY<RR><QQ>PDHGIVMN"
encoded = tokenizer.encode(test_string)
print(f"Input: {test_string}")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

# Test with another example
test_string2 = "ABC<PP><XX>DEF<RR>GHI"
encoded2 = tokenizer.encode(test_string2)
print(f"\nInput: {test_string2}")
print(f"Tokens: {encoded2.tokens}")
print(f"IDs: {encoded2.ids}")

# Optional: Convert to HuggingFace format for use with transformers
from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})

print(f"\nHuggingFace tokenizer test:")
print(f"Tokens: {hf_tokenizer.tokenize(test_string)}")
print(f"IDs: {hf_tokenizer.encode(test_string)}")

First of all is this what you intended?

I actually already tried a similar thing to this but ran into an error which I’m running into again now with this code:

TypeError: argument 'pretok': 'AtomicUnitPreTokenizer' object cannot be converted to 'PreTokenizer'

I tried debugging it but not much luck, any idea how to progress?

John6666 · July 3, 2025, 10:09am

Hmm… Like this?

#tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)
tokenizer.pre_tokenizer = PreTokenizer.custom(AtomicUnitPreTokenizer(atomic_units))

github.com/huggingface/tokenizers

Cannot inject custom PreTokenizer into Tokenizer

opened 07:56PM - 23 Sep 24 UTC

Old-Shatterhand

Hey, I want to train a Tokenizer that operates on a custom PreTokenizer. I tr…ied a mix of [this documentation post](https://huggingface.co/docs/tokenizers/pipeline) and [this example](https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py). My resulting code looks like this: ```python class GlyLESPreTokenizer: def __init__(self, *args, **kwargs): pass def __new__(cls, *args, **kwargs): return super().__new__(cls) def glyles_split(self, iupac: str): iuapc = iupac.strip().replace(" ", "") token = CommonTokenStream(GlyLESLexer(InputStream(data="{" + iupac + "}"))) GlyLESParser(token).start() idx = 0 output = [] for i in range(1, len(token.tokens) - 2): txt = str(token.tokens[i].text) output.append((txt, (idx, idx + len(txt)))) idx += len(txt) return output def pre_tokenize_str(self, input_: str): return self.glyles_split(input_) iupac = "QuiNAlaAc(b1-4)GalNAcA(a1-4)GalOAc(a1-2)QuiNAlaAc" # This returns a list of 33 token GlyLESPreTokenizer().pre_tokenize_str(iupac) # This however only returns a list with one token that is the entire input string. pre_tokenizers.PreTokenizer.custom(GlyLESPreTokenizer()).pre_tokenize_str(iupac) ``` The final idea is to use it in such setting: ```python tokenizer = Tokenizer(models.Model()) tokenizer.normalizer = normalizers.Strip() tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(GlyLESPreTokenizer()) ``` Can someone help me to understand how to use the `pre_tokenizers.PreTokenizer.custom` method to inject a custom, python-written PreTokenizer into a Tokenizer? Unfortunately, it is far beyond the scope of the project to convert the logic from [GlyLES](https://github.com/kalininalab/GlyLES) to RUST, so it has to be a Python PreTokenizer-class that is somehow injected into the Tokenizer. Thank you for any help, comment, or feedback in advance. Roman

GuyShur · July 3, 2025, 2:04pm

Thanks @John6666 that did get me closer!

I tried:

from tokenizers import Tokenizer, models, trainers, PreTokenizedString, NormalizedString
from tokenizers.pre_tokenizers import PreTokenizer, BertPreTokenizer
import re


class AtomicUnitPreTokenizer:
    def __init__(self, atomic_units):
        self.atomic_units = sorted(atomic_units, key=len, reverse=True)
        # Create regex pattern that matches any of the atomic units
        self.pattern = re.compile("|".join(re.escape(a) for a in self.atomic_units))
        print(f"Atomic units regex pattern: {self.pattern.pattern}")

    def split(self, _, norm: NormalizedString):
        # split the input string into tokens wrapped in < xx > and everything else
        tokens = []
        start = 0
        input_str = str(norm)
        for match in self.pattern.finditer(input_str):
            # Add everything before the match as a token
            if start < match.start():
                tokens.append(input_str[start : match.start()])
            # Add the matched atomic unit as a token
            tokens.append(input_str[match.start() : match.end()])
            start = match.end()
        # Add any remaining part of the string as a token
        if start < len(input_str):
            tokens.append(input_str[start:])

        print(f"Split tokens: {tokens}")
        return [NormalizedString(token) for token in tokens]

    def pre_tokenize(self, pre_tokenized: PreTokenizedString):
        return pre_tokenized.split(self.split)


# Define your atomic units - these will never be split
atomic_units = ["<RR>", "<QQ>"]
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Set up the pre-tokenizer to handle atomic units
tokenizer.pre_tokenizer = PreTokenizer.custom(AtomicUnitPreTokenizer(atomic_units))

# Set up the trainer
trainer = trainers.BpeTrainer(
    vocab_size=1000, special_tokens=["[UNK]"], min_frequency=1
)

# Example corpus containing your sequences
corpus = ["<RR><QQ>" for _ in range(3)]

# add the atomic units as regular tokens
for unit in atomic_units:
    tokenizer.add_tokens([unit])

print(f"Vocabulary before training: {tokenizer.get_vocab()}")

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer=trainer)

print(f"Vocabulary after training: {tokenizer.get_vocab()}")
# Test the tokenization

test_string = "<RR><QQ>"
encoded = tokenizer.encode(test_string, is_pretokenized=False)
print(f"\nInput: {test_string}")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
# Replacing with base tokenizer because custom pre-tokenizers can't be serialized
tokenizer.pre_tokenizer = BertPreTokenizer()
from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})

print(f"\nHuggingFace tokenizer test:")
print(f"Tokens: {hf_tokenizer.tokenize(test_string)}")
print(f"IDs: {hf_tokenizer.encode(test_string)}")

But unfortunately not much luck:

Atomic units regex pattern: <RR>|<QQ>
Vocabulary before training: {'<QQ>': 1, '<RR>': 0}
[00:00:00] Pre-processing sequences       █████████████████████████████████████████████████████████████ 3        /        0Split tokens: ['<RR>', '<QQ>']
Split tokens: ['<RR>', '<QQ>']
Split tokens: ['<RR>', '<QQ>']
[00:00:00] Pre-processing sequences       █████████████████████████████████████████████████████████████ 0        /        0[00:00:00] Tokenize words                 █████████████████████████████████████████████████████████████ 2        /        2
[00:00:00] Count pairs                    █████████████████████████████████████████████████████████████ 2        /        2
[00:00:00] Compute merges                 █████████████████████████████████████████████████████████████ 6        /        6
Vocabulary after training: {'<RR>': 0, 'Q': 3, '[UNK]': 0, 'R>': 8, 'Q>': 7, '>': 2, 'R': 4, '<': 1, '<R': 6, '<Q': 5, '<QQ>': 1}

Input: <RR><QQ>
Tokens: ['<RR>', '<QQ>']
IDs: [0, 1]

HuggingFace tokenizer test:
Tokens: ['<RR>', '<QQ>']
IDs: [10, 9]

Calling tokenizer.encode doesn’t cause the print statement in the pre_tokenizer, I wonder if it doesn’t pre-tokenize the examples properly.

John6666 · July 3, 2025, 2:17pm

Hmm… One dirty hack.

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer._tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units) # Added
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})

Topic		Replies	Views
Are special_tokens the only tokens guaranteed to be atomic? 🤗Tokenizers	0	373	March 3, 2021
Add BOS and EOS when encoding a sentence 🤗Tokenizers	2	14552	August 22, 2022
Get intermediate tokens and merges used in tokenization 🤗Tokenizers	0	468	December 1, 2023
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	605	May 5, 2021
How to add all standard special tokens to my tokenizer and model? Beginners	1	5893	August 11, 2022

Adding atomic / indivisible tokens to BPE tokenizer

Your atomic units

Tokenizer with custom pre-tokenizer

Corpus containing your sequences

Or even: [‘PHY’, ‘’, ‘’, ‘PDH’] — as long as and aren’t split

Related topics