Yes what you want is a (BPE) tokenizer with added pre-tokenized atomic substrings that Are treated as indivisible units (not split at any point), Can be merged with surrounding tokens (e.g. PHY),
Are multi-character, but always considered atomic,
Stay interpretable (not replaced with Unicode hacks or private-use glyphs). Here is how to do this cleanly with modern tokenization tools particularly
tokenizers (Rust-backed and flexible). That reserves a vocabulary of predefined atomic substrings
And learns merges on top of those atomic substrings Step-by-Step Implementation with tokenizers Define Your Atomic Units. atomic_units = [ββ, ββ, ββ, ββ, βHKLDCPHYβ] These must never be split even if theyβre multi-character. Pre-tokenize Your Corpus with Atomic Units as Special Tokens
Before training, wrap your corpus in a custom pre-tokenizer that, Scans input for atomic units,
Inserts them as single βtokensβ before BPE starts learning. from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders
from tokenizers.pre_tokenizers import PreTokenizer
import re
class AtomicUnitPreTokenizer:
def init(self, atomic_units):
self.atomic_units = sorted(atomic_units, key=len, reverse=True)
self.pattern = re.compile(β|β.join(re.escape(a) for a in self.atomic_units))
def pre_tokenize_str(self, input_str):
def _split(match):
# Return the atomic unit as its own token
return f" {match.group(0)} "
# Inject spaces around atomic units so tokenizers can treat them as isolated
split_str = self.pattern.sub(_split, input_str)
# Then split on whitespace
return [(token, (0, 0)) for token in split_str.strip().split()] plug this into HuggingFaceβs tokenizer pipeline Initialize and Train the Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
Your atomic units
atomic_units = [ββ, ββ, ββ, ββ]
Tokenizer with custom pre-tokenizer
tokenizer = Tokenizer(BPE(unk_token=β[UNK]β))
tokenizer.pre_tokenizer = AtomicUnitPreTokenizer(atomic_units)
trainer = BpeTrainer(vocab_size=1000, special_tokens=[β[UNK]β] + atomic_units)
Corpus containing your sequences
corpus = [βHKLDCPHYPDHGIVMNβ, βABCDEβ]
tokenizer.train_from_iterator(corpus, trainer=trainer) Test the Tokenization
encoded = tokenizer.encode(βHKLDCPHYPDHGIVMNβ)
print(encoded.tokens)
Good output:
[βHKLDCPHYβ, ββ, ββ, βPDHGIVMNβ]
Or even: [βPHYβ, ββ, ββ, βPDHβ] β as long as and arenβt split
Bad output:
[β<β, βRRβ, β>β, β<β, βQQβ, β>β]
Thatβs should be a way to successfully forced atomic tokens into the model without hacking the Unicode space or losing interpretability. The key is treating ββ and others as βpre-tokenizedβ units before training. You donβt need to substitute in weird ASCII or Unicode youβre just teaching the tokenizer to respect and treat them as atomic substrings.
This integrates well with HuggingFace transformers via PreTrainedTokenizerFast. Add to Special Tokens If you want , , etc. to be treated as special tokens later (e.g. for generation), you can
tokenizer.add_special_tokens([ββ, ββ, ββ, ββ])
If you just look to converting to transformers tokenizer
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
hf_tokenizer.add_special_tokens({βadditional_special_tokensβ: atomic_units})
You donβt need to encode them into Unicode glyphs or sacrifice interpretability.
You just need: A custom pre-tokenizer that isolates atomic unitsA BPE tokenizer trained on top of those preserved tokens Add to special_tokens for further control