Tokenizing Domain Specific Text

Hello everyone, I’ve been referencing this paper on training transformer based models using metadata enhanced MIDI and was thinking about implementing this using the huggingface transformers and tokenizer libraries as an introduction to these libraries beyond the basic language modeling examples. As I’ve been researching and referencing this tutorial I’ve ran into issues with tokenization and was wondering when training a tokenizer, how can I set up “word level” semantics? Technically each “word” in this case will be the data within a string like so ‘Event(name=Position, time=360, value=4/16, text=360)’ rather than just words and characters delimited on spaces like its doing now as listed below
#version: 0.2 - Trained by huggingface/tokenizers
m e
a l
u e
al ue
i me
n ame
v alue
Ġ value
Ġt ex
Ġt ime
Ev en
Ġtex t
Even t
) ,
Ġ Event
N o

Apologies on if these questions are noobish I’m grokking a lot of this as I go along. Any help is greatly appreciated.

pinging @anthony and @Narsil here :slight_smile:

1 Like

What is your original data like ?

The “Word level” semantics is usually dealt with the Pretokenizer logic (that basically splits up the data where it’s relevant). In your case, it would depend on your original data. There is more info in the docs:

If you have a specific example of what data comes in, and what you expect as a return, it’s probably going to be easier to give you pointers.


Hello Narsil, my data looks something similar to this: Note On_XX, Tempo Value_XX, Chord_Note:type, Position_X/X, where XX can be a number from 0 to 127 and X can be a number from 1 to 16. I also have a dictionary with every possible values for the above text types numericalized. Will I have to write a custom tokenizer in order to accomplish this tokenization? Is there a way to use my existing numericalized dictionary in a tokenizer? Thank you for your help and the documentation I’ve begun going over it.

Yes it seems you vocabulary is well defined and sufficiently small (127 * 16 * 16) to be in a “WordLevel” tokenizer.

You can even create your vocabulary manually which makes it easier to run.

vocab = {}
for value in range(0, 128):
    for position in range(0, 16):
      word = "Note on {value},Position_{position}"
      vocab[word] = i
      i += 1

import tokenizers
from tokenizers import pre_tokenizers

tokenizer = Tokenizer(BPE(vocab=vocab))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

encoded = tokenizer.encode("Note On_12,Position_16/16  Note on 45,Position_8/8")
encoded.ids # [3, 5]

The vocabulary creation I wrote seems a bit off compared to what you described but it should be close enough.

I assumed the possible values are relatively small (< 50 000) because of 128 notes + 16 tempos + 16 values which makes 32k possible values being conservative. If there’s actually an order of magnitude more, then this approach won’t work anymore.

Does that help ?

1 Like

Thank you this is incredibly helpful.