Tokenized sequence lengths

I have a strange situation where I’m trying to build a custom tokenizer for a custom “language” (encoded music). The language is designed to represent the data in a way that is compact and also reasonably human-readable. In most cases, the ideal tokenization would be essentially word-level, just separating by whitespace, but because I don’t necessarily know in advance all the possibilities, I want to avoid just forcing word-level tokenization.

What I’m finding is that, if I train a tokenizer using the basic “intro” approach on the Tokenizers doc page, e.g.:

tokenizer = Tokenizer(BPE())
corpus_root = './content'
paths = [str(x) for x in Path(corpus_root).glob("**/*.txt")]

tokenizer.pre_tokenizer = Whitespace()
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=4000)
tokenizer.train(files=paths, trainer=trainer)

then I get pretty much the ideal vocab, and the encoding works as expected. However, this tokenizer seems to work differently to those in the Transformers library (e.g., it doesn’t recognize tokenizer.vocab, and so on), so using a more “stock” Transformers tokenizer seems like the safer/easier option going forward.

As a workaround, I tried saving the tokenizer files using:

tokenizer.model.save('./tokenizer/roberta_tokenizer')
tokenizer.save('./tokenizer/roberta_tokenizer/config.json')

and then loading this into a RobertaTokenizer, using the saved vocab.json and merges.txt files. Although this does function, what I’m noticing is that the lengths of the encoded tokenizations are dramatically different—up to 3x longer for the same input when using RobertaTokenizer over the Tokenizer(BPE.from_file()) version, using the same files, e.g.:

count = len(input.split(' '))
output = tokenizer.encode(input)
print(f'{output.tokens}, count = {count}, encoded count = {len(output.ids)}')
tokenizer.model.save('./tokenizer/roberta_tokenizer')
tokenizer.save('./tokenizer/roberta_tokenizer/config.json')
roberta_tokenizer = RobertaTokenizer('./tokenizer/roberta_tokenizer/vocab.json', './tokenizer/roberta_tokenizer/merges.txt')
roberta_tokenizer.save_pretrained('./tokenizers/roberta-tokenizer')

test_tok = AutoTokenizer.from_pretrained('./tokenizers/roberta-tokenizer')
test_tok2 = Tokenizer(BPE.from_file('./tokenizer/roberta_tokenizer/vocab.json', './tokenizer/gpt2_tokenizer/merges.txt'))

test = test_tok.encode(input)
print("Roberta test: ", test)

test2 = test_tok2.encode(input)
print("Tokenizer(BPE()) test: ", test2.ids)

------- 

"Roberta test:  [4000, 192, 357, 52, 5, 194, 661, 11, 33, 32, 9, 39, 9, 111, 51, 58, 6, 130, 53, 46, 51, 58, 6, 146, 6, 260, 58, 6, 1372, 7, 584, 7, 1876, 58, 6, 99, 101, 9, 37, 8, 59, 6, 151, 9, 77, 695, 11, 4001]"
"Tokenizer(BPE()) test:  [939, 711, 196, 115, 131, 120, 295, 165, 233, 309, 99, 395, 100, 335, 568]"

I tried testing this with a small, natural-language dataset from the Datasets page, and I don’t see these dramatic differences—i.e., I only see differences of a few tokens, which could be explained by special/added tokens. So I’m assuming that it’s my “synthesized” language that is the difference. So is there a way to use Tokenizer(BPE.from_file()), but still have all the integration features of using, for example, RobertaTokenizer from the Transformers library?

I have been posting some of my saga on the Discord server, but I think this is probably a better place for something so detailed (and mysterious).

Any help would be greatly appreciated.

Or, alternately, does anyone know why:

tokenizer = Tokenizer(BPE.from_file('./tokenizer/roberta_tokenizer/vocab.json', './tokenizer/roberta_tokenizer/merges.txt'))
print("vocab_size: ", tokenizer.model.vocab)

Fails with an error that 'tokenizers.models.BPE' object has no attribute 'vocab'. According to the docs, it should have: Input sequences — tokenizers documentation

According to tokenizers.__version__ I’m running 0.11.0. These docs are for 0.10.0—is vocab removed in 0.11.0? Or is something just borked in my install?

UPDATE: I gave 0.10.1 a try, just for kicks, but same error.

Digging in further, it looks like the difference must be between BPE and ByteLevelBPETokenizer (i.e., RoBERTa’s tokenizer). With the former, I get the 4000 item vocab I want, but the latter only gives me a 1300 item vocab (despite indicating 4000 in the vocab_size).

So to get what I’m after, I have to either;

  1. figure out how to get the BPE version into a tokenizer that plays nice with transformers OR
  2. figure out how to get the ByteLevelBPETokenizer to learn a 4000 item vocab

Okay, I’ve made some progress with this approach:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast

train = True

if train:
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

    tokenizer.pre_tokenizer = Whitespace()
    tokenizer.train(files, trainer)

    tokenizer.save('./tokenizer/bpe_tokenizer/tokenizer.json')


fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="./tokenizer/bpe_tokenizer/tokenizer.json")

test = fast_tokenizer.encode(input)
print("test tokenization: ", test)

decode_test = fast_tokenizer.decode(test)
print("test decode: ", decode_test)

Then I modified run_clm.py to use tokenizer = PreTrainedTokenizerFast(tokenizer_file=model_args.model_name_or_path) instead of the original AutoTokenizer version.

This now works as expected. Worth noting is that you can’t use PreTrainedTokenizer (i.e., the slow version) or you’ll hit a NotImplemented error when trying to call tokenizer.encode(input).

That’s right. The methods of the tokenizers from the tokenizers library are different from the methods of the tokenizers from the transformers library. You want to use PreTrainedTokenizerFast from the transformers library to access the functionality of the transformers tokenizers.

I suspect you want to be working with a character-level BPE rather than a byte-level BPE tokenizer.

I’m also working on encoded music. We should talk :smile:

Yes, that’s what I finally realized, in a kind of roundabout way.

And yes again, we should talk!

btw, Huggingface people, I’m still wondering if there’s any way to force a larger vocabulary during training? Presumably this would just be more “merging”, no? Shouldn’t there be a parameter to force a larger vocab if you want it?

EDIT: I notice I was apparently getting the 4000 word vocab when I posted this, but that’s not the case now… I request vocab_size=4000 and I get 2026. Hmm…