I am dealing with a language where each sentence is a sequence of instructions, and each instruction has a character component and a numerical component. The number of possible instructions is known and is finite. There are a few hundred of them. Without getting into the idiosyncrasies of the language Iâm actually dealing with, consider the following language:
The instructions are of the form x:y, where x is in âABCDâ and y is in range(100).
Here is a sentence consisting of five instructions:
A:50 B:2 C:12 A:19 D:12.
I think of the instructions as âcharactersâ. This language does not have a natural notion of âwordsâ.
I would like to train a BPE tokenizer on a large dataset of instructions. I would think that I would not want the tokenizer to split my âcharacters,â but I cannot get any of the trainers to do that. For instance:
import random
import tokenizers
alphabet = []
second = ':'
for first in 'ABCD':
for third in range(0, 100):
alphabet.append(first + second + str(third))
print('alphabet =', alphabet)
text = ''
for i in range(1000):
first = 'ABCD'[random.randint(0, 3)]
second = ':'
third = str(random.randint(0, 99))
text += first + second + third + ' '
text = text.rstrip(' ')
print('text =', text)
tokenizer = tokenizers.SentencePieceBPETokenizer()
tokenizer.train_from_iterator([text],
vocab_size=1000,
min_frequency=1,
limit_alphabet=500,
)
for k in tokenizer.get_vocab():
print(k)
print('vocab size =', len(tokenizer.get_vocab()))
Output:
alphabet = ['A:0', 'A:1', 'A:2', 'A:3', 'A:4', 'A:5', 'A:6', 'A:7', 'A:8', 'A:9', 'A:10', 'A:11', 'A:12', 'A:13', 'A:14', 'A:15', 'A:16',...
text = B:70 D:76 B:82 C:61 A:6 B:73 B:58 C:2 D:60 A:28 C:35 C:85 A:90 A:61 B:84 C:10 C:28 A:36 A:12 A:9 B:48 A:56 B:89 B:44...
âD:50
âB:11
âB:74
âB:44
âA:52
âB:56
âA:44
âC:59
âA:92
âC:68
âD:64
âB:77
âB:85
âD:82
7
âA:51
âA:8
...
âA:45
:
vocab size = 400
That wonât work because itâs splitting on whitespace before training, so it will never encode more than one instruction per vocabulary token. Letâs try replacing the whitespaces with semicolons instead.
tokenizer = tokenizers.SentencePieceBPETokenizer()
tokenizer.train_from_iterator([text.replace(' ', ';')],
vocab_size=1000,
min_frequency=1,
limit_alphabet=500,
)
for k in tokenizer.get_vocab():
print(k)
print('vocab size =', len(tokenizer.get_vocab()))
Output:
;A:69;B:5
6;C:8
0;B:4
3;B:4
6;D:8
9;A:5
;D:81;C:64;B:7
6;B:30;C:92;A:95;C:67
...
vocab size = 1000
Thatâs better because itâs generally encoding more than one instruction per vocabulary token now, but see how the vocabulary has tokens in it that encode only parts of an instruction (even if other instructions are included fully), like the second one? Iâm not sure thatâs ideal, so is it possible to tell a tokenizer to never split certain strings while training? I have tried using the âinitial_alphabetâ and âspecial_tokensâ options in train_from_iterator to accomplish this, with no success. To be clear, I would hope to end up with a vocabulary like:
;D:96;A:3
;B:2
;D:94;A:5;B:17
...