Hello everyone, Iāve been referencing this paper on training transformer based models using metadata enhanced MIDI and was thinking about implementing this using the huggingface transformers and tokenizer libraries as an introduction to these libraries beyond the basic language modeling examples. As Iāve been researching and referencing this tutorial Iāve ran into issues with tokenization and was wondering when training a tokenizer, how can I set up āword levelā semantics? Technically each āwordā in this case will be the data within a string like so āEvent(name=Position, time=360, value=4/16, text=360)ā rather than just words and characters delimited on spaces like its doing now as listed below #version: 0.2 - Trained by huggingface/tokenizers m e a l u e al ue i me n ame v alue Ä value Ä t ex Ä t ime Ev en Ä tex t Even t ) , Ä Event N o
Apologies on if these questions are noobish Iām grokking a lot of this as I go along. Any help is greatly appreciated.
The āWord levelā semantics is usually dealt with the Pretokenizer logic (that basically splits up the data where itās relevant). In your case, it would depend on your original data. There is more info in the docs:
If you have a specific example of what data comes in, and what you expect as a return, itās probably going to be easier to give you pointers.
Hello Narsil, my data looks something similar to this: Note On_XX, Tempo Value_XX, Chord_Note:type, Position_X/X, where XX can be a number from 0 to 127 and X can be a number from 1 to 16. I also have a dictionary with every possible values for the above text types numericalized. Will I have to write a custom tokenizer in order to accomplish this tokenization? Is there a way to use my existing numericalized dictionary in a tokenizer? Thank you for your help and the documentation Iāve begun going over it.
Yes it seems you vocabulary is well defined and sufficiently small (127 * 16 * 16) to be in a āWordLevelā tokenizer.
You can even create your vocabulary manually which makes it easier to run.
vocab = {}
i=0
for value in range(0, 128):
for position in range(0, 16):
word = "Note on {value},Position_{position}"
vocab[word] = i
i += 1
import tokenizers
from tokenizers import pre_tokenizers
tokenizer = Tokenizer(BPE(vocab=vocab))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
encoded = tokenizer.encode("Note On_12,Position_16/16 Note on 45,Position_8/8")
encoded.ids # [3, 5]
The vocabulary creation I wrote seems a bit off compared to what you described but it should be close enough.
I assumed the possible values are relatively small (< 50 000) because of 128 notes + 16 tempos + 16 values which makes 32k possible values being conservative. If thereās actually an order of magnitude more, then this approach wonāt work anymore.