Adding atomic / indivisible tokens to BPE tokenizer

Hi all,

I am trying to create a BPE tokenizer that has added β€œatomic” tokens before training, meaning words of length > 1 that should always be tokenized as one indivisible unit (but possibly merged with other words).

For example, my input text can look like: β€œHKLDCPHY<RR><QQ>PDHGIVMN”, where <RR> and <QQ> must be treated essentially as single characters. After tokenization they can be their own tokens or merged to something like PHY<RR> or <RR><QQ> but they should never be split (something like [β€œPHY”, β€œ<RR”, β€œ><”, β€œQQ”, β€œ>”, β€œPDH”,…] is bad, for example).

Is there a way to do this?

I would encode the important substrings to single chars but I have a lot of them and past the ASCII characters I am sometimes getting β€œΓ‚β€ appended to tokens containing the rest of the characters, plus I really want the tokens to be Interpretable.

Thanks in advance.

1 Like

Special tokens seem to be considered atomic. However, the implementation of special tokens is quite complex (it has been revised and changed over a long period of time), so it would be safer to search for information while working on it.

If I add atomic tokens as special ones, will it stop them from being split?

1 Like

Yes what you want is a (BPE) tokenizer with added pre-tokenized atomic substrings that Are treated as indivisible units (not split at any point), Can be merged with surrounding tokens (e.g. PHY),
Are multi-character, but always considered atomic,
Stay interpretable (not replaced with Unicode hacks or private-use glyphs). Here is how to do this cleanly with modern tokenization tools particularly :hugs: tokenizers (Rust-backed and flexible). That reserves a vocabulary of predefined atomic substrings
And learns merges on top of those atomic substrings Step-by-Step Implementation with tokenizers Define Your Atomic Units. atomic_units = [β€œβ€, β€œβ€, β€œβ€, β€œβ€, β€œHKLDCPHY”] These must never be split even if they’re multi-character. Pre-tokenize Your Corpus with Atomic Units as Special Tokens
Before training, wrap your corpus in a custom pre-tokenizer that, Scans input for atomic units,
Inserts them as single β€œtokens” before BPE starts learning. from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders
from tokenizers.pre_tokenizers import PreTokenizer
import re

class AtomicUnitPreTokenizer:
def init(self, atomic_units):
self.atomic_units = sorted(atomic_units, key=len, reverse=True)
self.pattern = re.compile(β€œ|”.join(re.escape(a) for a in self.atomic_units))

def pre_tokenize_str(self, input_str):
    def _split(match):
        # Return the atomic unit as its own token
        return f" {match.group(0)} "

    # Inject spaces around atomic units so tokenizers can treat them as isolated
    split_str = self.pattern.sub(_split, input_str)
    # Then split on whitespace
    return [(token, (0, 0)) for token in split_str.strip().split()] plug this into HuggingFace’s tokenizer pipeline Initialize and Train the Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

Your atomic units

atomic_units = [β€œβ€, β€œβ€, β€œβ€, β€œβ€]

Tokenizer with custom pre-tokenizer

tokenizer = Tokenizer(BPE(unk_token=β€œ[UNK]”))
tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)

trainer = BpeTrainer(vocab_size=1000, special_tokens=[β€œ[UNK]”] + atomic_units)

Corpus containing your sequences

corpus = [β€œHKLDCPHYPDHGIVMN”, β€œABCDE”]
tokenizer.train_from_iterator(corpus, trainer=trainer) Test the Tokenization
encoded = tokenizer.encode(β€œHKLDCPHYPDHGIVMN”)
print(encoded.tokens)
Good output:

[β€˜HKLDCPHY’, β€˜β€™, β€˜β€™, β€˜PDHGIVMN’]

Or even: [β€˜PHY’, β€˜β€™, β€˜β€™, β€˜PDH’] β€” as long as and aren’t split

Bad output:

[β€˜<’, β€˜RR’, β€˜>’, β€˜<’, β€˜QQ’, β€˜>’]
That’s should be a way to successfully forced atomic tokens into the model without hacking the Unicode space or losing interpretability. The key is treating β€œβ€ and others as β€œpre-tokenized” units before training. You don’t need to substitute in weird ASCII or Unicode you’re just teaching the tokenizer to respect and treat them as atomic substrings.
This integrates well with HuggingFace transformers via PreTrainedTokenizerFast. Add to Special Tokens If you want , , etc. to be treated as special tokens later (e.g. for generation), you can
tokenizer.add_special_tokens([β€œβ€, β€œβ€, β€œβ€, β€œβ€])
If you just look to converting to transformers tokenizer
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({β€œadditional_special_tokens”: atomic_units})
You don’t need to encode them into Unicode glyphs or sacrifice interpretability.
You just need: A custom pre-tokenizer that isolates atomic unitsA BPE tokenizer trained on top of those preserved tokens Add to special_tokens for further control

1 Like

Thanks @John6666,

Adding these as special tokens works to keep them from fragmenting, but I would actually like these to participate in merges, and special tokens don’t do that. Is there a way to achieve this outcome?

1 Like

Thank you @NuralNexus!

I tried to recreate your code because it’s a bit mangled and this is what I have:

from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders
from tokenizers.pre_tokenizers import PreTokenizer
import re


class AtomicUnitPreTokenizer:
    def __init__(self, atomic_units):
        self.atomic_units = sorted(atomic_units, key=len, reverse=True)
        # Create regex pattern that matches any of the atomic units
        self.pattern = re.compile("|".join(re.escape(a) for a in self.atomic_units))

    def pre_tokenize_str(self, input_str):
        tokens = []
        last_end = 0

        # Find all atomic units in the string
        for match in self.pattern.finditer(input_str):
            start, end = match.span()

            # Add any text before the atomic unit
            if start > last_end:
                before_text = input_str[last_end:start]
                if before_text:
                    tokens.append((before_text, (last_end, start)))

            # Add the atomic unit as a single token
            tokens.append((match.group(0), (start, end)))
            last_end = end

        # Add any remaining text after the last atomic unit
        if last_end < len(input_str):
            remaining_text = input_str[last_end:]
            if remaining_text:
                tokens.append((remaining_text, (last_end, len(input_str))))

        return tokens


# Define your atomic units - these will never be split
atomic_units = ["<RR>", "<QQ>", "<PP>", "<XX>"]

# Initialize tokenizer with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Set up the pre-tokenizer to handle atomic units
tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)

# Set up the trainer
trainer = trainers.BpeTrainer(
    vocab_size=1000, special_tokens=["[UNK]"] + atomic_units, min_frequency=2
)

# Example corpus containing your sequences
corpus = [
    "HKLDCPHY<RR><QQ>PDHGIVMN",
    "ABCDE<PP><XX>FGHIJK",
    "MNOP<RR>QRST<QQ>UVWX",
    "YZAB<PP>CDEF<XX>GHIJ",
]

corpus += ["<QQ><RR>"] * 100

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer=trainer)

# Test the tokenization
test_string = "HKLDCPHY<RR><QQ>PDHGIVMN"
encoded = tokenizer.encode(test_string)
print(f"Input: {test_string}")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

# Test with another example
test_string2 = "ABC<PP><XX>DEF<RR>GHI"
encoded2 = tokenizer.encode(test_string2)
print(f"\nInput: {test_string2}")
print(f"Tokens: {encoded2.tokens}")
print(f"IDs: {encoded2.ids}")

# Optional: Convert to HuggingFace format for use with transformers
from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})

print(f"\nHuggingFace tokenizer test:")
print(f"Tokens: {hf_tokenizer.tokenize(test_string)}")
print(f"IDs: {hf_tokenizer.encode(test_string)}")

First of all is this what you intended?

I actually already tried a similar thing to this but ran into an error which I’m running into again now with this code:

TypeError: argument 'pretok': 'AtomicUnitPreTokenizer' object cannot be converted to 'PreTokenizer'

I tried debugging it but not much luck, any idea how to progress?

1 Like

Hmm… Like this?

#tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)
tokenizer.pre_tokenizer = PreTokenizer.custom(AtomicUnitPreTokenizer(atomic_units))

Thanks @John6666 that did get me closer!

I tried:

from tokenizers import Tokenizer, models, trainers, PreTokenizedString, NormalizedString
from tokenizers.pre_tokenizers import PreTokenizer, BertPreTokenizer
import re


class AtomicUnitPreTokenizer:
    def __init__(self, atomic_units):
        self.atomic_units = sorted(atomic_units, key=len, reverse=True)
        # Create regex pattern that matches any of the atomic units
        self.pattern = re.compile("|".join(re.escape(a) for a in self.atomic_units))
        print(f"Atomic units regex pattern: {self.pattern.pattern}")

    def split(self, _, norm: NormalizedString):
        # split the input string into tokens wrapped in < xx > and everything else
        tokens = []
        start = 0
        input_str = str(norm)
        for match in self.pattern.finditer(input_str):
            # Add everything before the match as a token
            if start < match.start():
                tokens.append(input_str[start : match.start()])
            # Add the matched atomic unit as a token
            tokens.append(input_str[match.start() : match.end()])
            start = match.end()
        # Add any remaining part of the string as a token
        if start < len(input_str):
            tokens.append(input_str[start:])

        print(f"Split tokens: {tokens}")
        return [NormalizedString(token) for token in tokens]

    def pre_tokenize(self, pre_tokenized: PreTokenizedString):
        return pre_tokenized.split(self.split)


# Define your atomic units - these will never be split
atomic_units = ["<RR>", "<QQ>"]
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Set up the pre-tokenizer to handle atomic units
tokenizer.pre_tokenizer = PreTokenizer.custom(AtomicUnitPreTokenizer(atomic_units))

# Set up the trainer
trainer = trainers.BpeTrainer(
    vocab_size=1000, special_tokens=["[UNK]"], min_frequency=1
)

# Example corpus containing your sequences
corpus = ["<RR><QQ>" for _ in range(3)]

# add the atomic units as regular tokens
for unit in atomic_units:
    tokenizer.add_tokens([unit])

print(f"Vocabulary before training: {tokenizer.get_vocab()}")

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer=trainer)

print(f"Vocabulary after training: {tokenizer.get_vocab()}")
# Test the tokenization

test_string = "<RR><QQ>"
encoded = tokenizer.encode(test_string, is_pretokenized=False)
print(f"\nInput: {test_string}")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
# Replacing with base tokenizer because custom pre-tokenizers can't be serialized
tokenizer.pre_tokenizer = BertPreTokenizer()
from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})

print(f"\nHuggingFace tokenizer test:")
print(f"Tokens: {hf_tokenizer.tokenize(test_string)}")
print(f"IDs: {hf_tokenizer.encode(test_string)}")

But unfortunately not much luck:

Atomic units regex pattern: <RR>|<QQ>
Vocabulary before training: {'<QQ>': 1, '<RR>': 0}
[00:00:00] Pre-processing sequences       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 3        /        0Split tokens: ['<RR>', '<QQ>']
Split tokens: ['<RR>', '<QQ>']
Split tokens: ['<RR>', '<QQ>']
[00:00:00] Pre-processing sequences       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0        /        0[00:00:00] Tokenize words                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 2        /        2
[00:00:00] Count pairs                    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 2        /        2
[00:00:00] Compute merges                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 6        /        6
Vocabulary after training: {'<RR>': 0, 'Q': 3, '[UNK]': 0, 'R>': 8, 'Q>': 7, '>': 2, 'R': 4, '<': 1, '<R': 6, '<Q': 5, '<QQ>': 1}

Input: <RR><QQ>
Tokens: ['<RR>', '<QQ>']
IDs: [0, 1]

HuggingFace tokenizer test:
Tokens: ['<RR>', '<QQ>']
IDs: [10, 9]

Calling tokenizer.encode doesn’t cause the print statement in the pre_tokenizer, I wonder if it doesn’t pre-tokenize the examples properly.

1 Like

Hmm… One dirty hack.

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer._tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units) # Added
hf_tokenizer.add_special_tokens({"additional_special_tokens": atomic_units})