Tokenizing Domain Specific Text

jodiak · November 11, 2020, 5:29am

Hello everyone, I’ve been referencing this paper on training transformer based models using metadata enhanced MIDI and was thinking about implementing this using the huggingface transformers and tokenizer libraries as an introduction to these libraries beyond the basic language modeling examples. As I’ve been researching and referencing this tutorial I’ve ran into issues with tokenization and was wondering when training a tokenizer, how can I set up “word level” semantics? Technically each “word” in this case will be the data within a string like so ‘Event(name=Position, time=360, value=4/16, text=360)’ rather than just words and characters delimited on spaces like its doing now as listed below
#version: 0.2 - Trained by huggingface/tokenizers
m e
a l
u e
al ue
i me
n ame
v alue
Ġ value
Ġt ex
Ġt ime
Ev en
Ġtex t
Even t
) ,
Ġ Event
N o

Apologies on if these questions are noobish I’m grokking a lot of this as I go along. Any help is greatly appreciated.

thomwolf · November 12, 2020, 12:48pm

pinging @anthony and @Narsil here

Narsil · November 12, 2020, 1:18pm

What is your original data like ?

The “Word level” semantics is usually dealt with the Pretokenizer logic (that basically splits up the data where it’s relevant). In your case, it would depend on your original data. There is more info in the docs:

If you have a specific example of what data comes in, and what you expect as a return, it’s probably going to be easier to give you pointers.

Cheers,
Nicolas

jodiak · November 17, 2020, 7:31am

Hello Narsil, my data looks something similar to this: Note On_XX, Tempo Value_XX, Chord_Note:type, Position_X/X, where XX can be a number from 0 to 127 and X can be a number from 1 to 16. I also have a dictionary with every possible values for the above text types numericalized. Will I have to write a custom tokenizer in order to accomplish this tokenization? Is there a way to use my existing numericalized dictionary in a tokenizer? Thank you for your help and the documentation I’ve begun going over it.

Narsil · November 17, 2020, 9:05am

Yes it seems you vocabulary is well defined and sufficiently small (127 * 16 * 16) to be in a “WordLevel” tokenizer.

You can even create your vocabulary manually which makes it easier to run.

vocab = {}
i=0
for value in range(0, 128):
    for position in range(0, 16):
      word = "Note on {value},Position_{position}"
      vocab[word] = i
      i += 1

import tokenizers
from tokenizers import pre_tokenizers

tokenizer = Tokenizer(BPE(vocab=vocab))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

encoded = tokenizer.encode("Note On_12,Position_16/16  Note on 45,Position_8/8")
encoded.ids # [3, 5]

The vocabulary creation I wrote seems a bit off compared to what you described but it should be close enough.

I assumed the possible values are relatively small (< 50 000) because of 128 notes + 16 tempos + 16 values which makes 32k possible values being conservative. If there’s actually an order of magnitude more, then this approach won’t work anymore.

Does that help ?

jodiak · November 20, 2020, 4:57am

Thank you this is incredibly helpful.

Topic		Replies	Views
LM finetuning on domain specific unlabelled data Beginners	6	4676	April 21, 2021
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4400	February 20, 2022
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	96	January 9, 2025
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
Incorporating my tokenizer into huggingface 🤗Tokenizers	0	249	February 15, 2024

Tokenizing Domain Specific Text

Related topics