I have a strange situation where I’m trying to build a custom tokenizer for a custom “language” (encoded music). The language is designed to represent the data in a way that is compact and also reasonably human-readable. In most cases, the ideal tokenization would be essentially word-level, just separating by whitespace, but because I don’t necessarily know in advance all the possibilities, I want to avoid just forcing word-level tokenization.
What I’m finding is that, if I train a tokenizer using the basic “intro” approach on the Tokenizers doc page, e.g.:
tokenizer = Tokenizer(BPE())
corpus_root = './content'
paths = [str(x) for x in Path(corpus_root).glob("**/*.txt")]
tokenizer.pre_tokenizer = Whitespace()
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=4000)
tokenizer.train(files=paths, trainer=trainer)
then I get pretty much the ideal vocab, and the encoding works as expected. However, this tokenizer seems to work differently to those in the Transformers library (e.g., it doesn’t recognize tokenizer.vocab
, and so on), so using a more “stock” Transformers tokenizer seems like the safer/easier option going forward.
As a workaround, I tried saving the tokenizer files using:
tokenizer.model.save('./tokenizer/roberta_tokenizer')
tokenizer.save('./tokenizer/roberta_tokenizer/config.json')
and then loading this into a RobertaTokenizer, using the saved vocab.json
and merges.txt
files. Although this does function, what I’m noticing is that the lengths of the encoded tokenizations are dramatically different—up to 3x longer for the same input
when using RobertaTokenizer over the Tokenizer(BPE.from_file())
version, using the same files, e.g.:
count = len(input.split(' '))
output = tokenizer.encode(input)
print(f'{output.tokens}, count = {count}, encoded count = {len(output.ids)}')
tokenizer.model.save('./tokenizer/roberta_tokenizer')
tokenizer.save('./tokenizer/roberta_tokenizer/config.json')
roberta_tokenizer = RobertaTokenizer('./tokenizer/roberta_tokenizer/vocab.json', './tokenizer/roberta_tokenizer/merges.txt')
roberta_tokenizer.save_pretrained('./tokenizers/roberta-tokenizer')
test_tok = AutoTokenizer.from_pretrained('./tokenizers/roberta-tokenizer')
test_tok2 = Tokenizer(BPE.from_file('./tokenizer/roberta_tokenizer/vocab.json', './tokenizer/gpt2_tokenizer/merges.txt'))
test = test_tok.encode(input)
print("Roberta test: ", test)
test2 = test_tok2.encode(input)
print("Tokenizer(BPE()) test: ", test2.ids)
-------
"Roberta test: [4000, 192, 357, 52, 5, 194, 661, 11, 33, 32, 9, 39, 9, 111, 51, 58, 6, 130, 53, 46, 51, 58, 6, 146, 6, 260, 58, 6, 1372, 7, 584, 7, 1876, 58, 6, 99, 101, 9, 37, 8, 59, 6, 151, 9, 77, 695, 11, 4001]"
"Tokenizer(BPE()) test: [939, 711, 196, 115, 131, 120, 295, 165, 233, 309, 99, 395, 100, 335, 568]"
I tried testing this with a small, natural-language dataset from the Datasets page, and I don’t see these dramatic differences—i.e., I only see differences of a few tokens, which could be explained by special/added tokens. So I’m assuming that it’s my “synthesized” language that is the difference. So is there a way to use Tokenizer(BPE.from_file())
, but still have all the integration features of using, for example, RobertaTokenizer
from the Transformers library?
I have been posting some of my saga on the Discord server, but I think this is probably a better place for something so detailed (and mysterious).
Any help would be greatly appreciated.