Tokenizing Domain Specific Text

Hello everyone, Iā€™ve been referencing this paper on training transformer based models using metadata enhanced MIDI and was thinking about implementing this using the huggingface transformers and tokenizer libraries as an introduction to these libraries beyond the basic language modeling examples. As Iā€™ve been researching and referencing this tutorial Iā€™ve ran into issues with tokenization and was wondering when training a tokenizer, how can I set up ā€œword levelā€ semantics? Technically each ā€œwordā€ in this case will be the data within a string like so ā€˜Event(name=Position, time=360, value=4/16, text=360)ā€™ rather than just words and characters delimited on spaces like its doing now as listed below
#version: 0.2 - Trained by huggingface/tokenizers
m e
a l
u e
al ue
i me
n ame
v alue
Ä  value
Ä t ex
Ä t ime
Ev en
Ä tex t
Even t
) ,
Ä  Event
N o

Apologies on if these questions are noobish Iā€™m grokking a lot of this as I go along. Any help is greatly appreciated.

1 Like

pinging @anthony and @Narsil here :slight_smile:

1 Like

What is your original data like ?

The ā€œWord levelā€ semantics is usually dealt with the Pretokenizer logic (that basically splits up the data where itā€™s relevant). In your case, it would depend on your original data. There is more info in the docs:

If you have a specific example of what data comes in, and what you expect as a return, itā€™s probably going to be easier to give you pointers.

Cheers,
Nicolas

Hello Narsil, my data looks something similar to this: Note On_XX, Tempo Value_XX, Chord_Note:type, Position_X/X, where XX can be a number from 0 to 127 and X can be a number from 1 to 16. I also have a dictionary with every possible values for the above text types numericalized. Will I have to write a custom tokenizer in order to accomplish this tokenization? Is there a way to use my existing numericalized dictionary in a tokenizer? Thank you for your help and the documentation Iā€™ve begun going over it.

Yes it seems you vocabulary is well defined and sufficiently small (127 * 16 * 16) to be in a ā€œWordLevelā€ tokenizer.

You can even create your vocabulary manually which makes it easier to run.

vocab = {}
i=0
for value in range(0, 128):
    for position in range(0, 16):
      word = "Note on {value},Position_{position}"
      vocab[word] = i
      i += 1

import tokenizers
from tokenizers import pre_tokenizers

tokenizer = Tokenizer(BPE(vocab=vocab))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

encoded = tokenizer.encode("Note On_12,Position_16/16  Note on 45,Position_8/8")
encoded.ids # [3, 5]

The vocabulary creation I wrote seems a bit off compared to what you described but it should be close enough.

I assumed the possible values are relatively small (< 50 000) because of 128 notes + 16 tempos + 16 values which makes 32k possible values being conservative. If thereā€™s actually an order of magnitude more, then this approach wonā€™t work anymore.

Does that help ?

2 Likes

Thank you this is incredibly helpful.