I am dealing with a language where each sentence is a sequence of instructions, and each instruction has a character component and a numerical component. The number of possible instructions is known and is finite. There are a few hundred of them. Without getting into the idiosyncrasies of the language I’m actually dealing with, consider the following language:
The instructions are of the form x:y, where x is in ‘ABCD’ and y is in range(100).
Here is a sentence consisting of five instructions:
A:50 B:2 C:12 A:19 D:12.
I think of the instructions as “characters”. This language does not have a natural notion of “words”.
I would like to train a BPE tokenizer on a large dataset of instructions. I would think that I would not want the tokenizer to split my “characters,” but I cannot get any of the trainers to do that. For instance:
import random import tokenizers alphabet =  second = ':' for first in 'ABCD': for third in range(0, 100): alphabet.append(first + second + str(third)) print('alphabet =', alphabet) text = '' for i in range(1000): first = 'ABCD'[random.randint(0, 3)] second = ':' third = str(random.randint(0, 99)) text += first + second + third + ' ' text = text.rstrip(' ') print('text =', text) tokenizer = tokenizers.SentencePieceBPETokenizer() tokenizer.train_from_iterator([text], vocab_size=1000, min_frequency=1, limit_alphabet=500, ) for k in tokenizer.get_vocab(): print(k) print('vocab size =', len(tokenizer.get_vocab()))
alphabet = ['A:0', 'A:1', 'A:2', 'A:3', 'A:4', 'A:5', 'A:6', 'A:7', 'A:8', 'A:9', 'A:10', 'A:11', 'A:12', 'A:13', 'A:14', 'A:15', 'A:16',... text = B:70 D:76 B:82 C:61 A:6 B:73 B:58 C:2 D:60 A:28 C:35 C:85 A:90 A:61 B:84 C:10 C:28 A:36 A:12 A:9 B:48 A:56 B:89 B:44... ▁D:50 ▁B:11 ▁B:74 ▁B:44 ▁A:52 ▁B:56 ▁A:44 ▁C:59 ▁A:92 ▁C:68 ▁D:64 ▁B:77 ▁B:85 ▁D:82 7 ▁A:51 ▁A:8 ... ▁A:45 : vocab size = 400
That won’t work because it’s splitting on whitespace before training, so it will never encode more than one instruction per vocabulary token. Let’s try replacing the whitespaces with semicolons instead.
tokenizer = tokenizers.SentencePieceBPETokenizer() tokenizer.train_from_iterator([text.replace(' ', ';')], vocab_size=1000, min_frequency=1, limit_alphabet=500, ) for k in tokenizer.get_vocab(): print(k) print('vocab size =', len(tokenizer.get_vocab()))
;A:69;B:5 6;C:8 0;B:4 3;B:4 6;D:8 9;A:5 ;D:81;C:64;B:7 6;B:30;C:92;A:95;C:67 ... vocab size = 1000
That’s better because it’s generally encoding more than one instruction per vocabulary token now, but see how the vocabulary has tokens in it that encode only parts of an instruction (even if other instructions are included fully), like the second one? I’m not sure that’s ideal, so is it possible to tell a tokenizer to never split certain strings while training? I have tried using the “initial_alphabet” and “special_tokens” options in train_from_iterator to accomplish this, with no success. To be clear, I would hope to end up with a vocabulary like:
;D:96;A:3 ;B:2 ;D:94;A:5;B:17 ...