How would you train a sentencepiece BPE tokenizer on this language with 400 "characters"?

mmalandro · February 13, 2022, 6:17am

I am dealing with a language where each sentence is a sequence of instructions, and each instruction has a character component and a numerical component. The number of possible instructions is known and is finite. There are a few hundred of them. Without getting into the idiosyncrasies of the language I’m actually dealing with, consider the following language:

The instructions are of the form x:y, where x is in ‘ABCD’ and y is in range(100).

Here is a sentence consisting of five instructions:

A:50 B:2 C:12 A:19 D:12.

I think of the instructions as “characters”. This language does not have a natural notion of “words”.

I would like to train a BPE tokenizer on a large dataset of instructions. I would think that I would not want the tokenizer to split my “characters,” but I cannot get any of the trainers to do that. For instance:

import random
import tokenizers

alphabet = []
second = ':'
for first in 'ABCD':
    for third in range(0, 100):
        alphabet.append(first + second + str(third))
print('alphabet =', alphabet)

text = ''
for i in range(1000):
    first = 'ABCD'[random.randint(0, 3)]
    second = ':'
    third = str(random.randint(0, 99))
    text += first + second + third + ' '
text = text.rstrip(' ')

print('text =', text)

tokenizer = tokenizers.SentencePieceBPETokenizer()
tokenizer.train_from_iterator([text],
                              vocab_size=1000,
                              min_frequency=1,
                              limit_alphabet=500,
                              )

for k in tokenizer.get_vocab():
    print(k)

print('vocab size =', len(tokenizer.get_vocab()))

Output:

alphabet = ['A:0', 'A:1', 'A:2', 'A:3', 'A:4', 'A:5', 'A:6', 'A:7', 'A:8', 'A:9', 'A:10', 'A:11', 'A:12', 'A:13', 'A:14', 'A:15', 'A:16',...
text = B:70 D:76 B:82 C:61 A:6 B:73 B:58 C:2 D:60 A:28 C:35 C:85 A:90 A:61 B:84 C:10 C:28 A:36 A:12 A:9 B:48 A:56 B:89 B:44...

▁D:50
▁B:11
▁B:74
▁B:44
▁A:52
▁B:56
▁A:44
▁C:59
▁A:92
▁C:68
▁D:64
▁B:77
▁B:85
▁D:82
7
▁A:51
▁A:8
...
▁A:45
:
vocab size = 400

That won’t work because it’s splitting on whitespace before training, so it will never encode more than one instruction per vocabulary token. Let’s try replacing the whitespaces with semicolons instead.

tokenizer = tokenizers.SentencePieceBPETokenizer()

tokenizer.train_from_iterator([text.replace(' ', ';')],
                              vocab_size=1000,
                              min_frequency=1,
                              limit_alphabet=500,
                              )

for k in tokenizer.get_vocab():
    print(k)

print('vocab size =', len(tokenizer.get_vocab()))

Output:

;A:69;B:5
6;C:8
0;B:4
3;B:4
6;D:8
9;A:5
;D:81;C:64;B:7
6;B:30;C:92;A:95;C:67
...
vocab size = 1000

That’s better because it’s generally encoding more than one instruction per vocabulary token now, but see how the vocabulary has tokens in it that encode only parts of an instruction (even if other instructions are included fully), like the second one? I’m not sure that’s ideal, so is it possible to tell a tokenizer to never split certain strings while training? I have tried using the “initial_alphabet” and “special_tokens” options in train_from_iterator to accomplish this, with no success. To be clear, I would hope to end up with a vocabulary like:

;D:96;A:3
;B:2
;D:94;A:5;B:17
...

Topic		Replies	Views
Documentation of SentencePieceBPETokenizer? 🤗Tokenizers	0	778	May 2, 2024
Training sentencePiece from scratch? 🤗Tokenizers	8	19023	December 19, 2023
Tokenizer taking extremely long time to train 🤗Tokenizers	1	957	March 19, 2025
How to reconstruct a sentence after it is encoded using BPE? Beginners	2	808	April 18, 2023
How do you use SentencePiece for BPE of sequences with no whitespace 🤗Tokenizers	1	2077	April 29, 2021

How would you train a sentencepiece BPE tokenizer on this language with 400 "characters"?

Related topics