Hello everyone, Iāve been referencing this paper on training transformer based models using metadata enhanced MIDI and was thinking about implementing this using the huggingface transformers and tokenizer libraries as an introduction to these libraries beyond the basic language modeling examples. As Iāve been researching and referencing this tutorial Iāve ran into issues with tokenization and was wondering when training a tokenizer, how can I set up āword levelā semantics? Technically each āwordā in this case will be the data within a string like so āEvent(name=Position, time=360, value=4/16, text=360)ā rather than just words and characters delimited on spaces like its doing now as listed below
#version: 0.2 - Trained by huggingface/tokenizers
m e
a l
u e
al ue
i me
n ame
v alue
Ä value
Ä t ex
Ä t ime
Ev en
Ä tex t
Even t
) ,
Ä Event
N o
Apologies on if these questions are noobish Iām grokking a lot of this as I go along. Any help is greatly appreciated.