I am thinking of pretraining a GPT2 model from scratch where the tokens must be of whole words (e.g. using the WordLevel
tokenizer) instead of subwords (e.g. using the ByteLevelBPETokenizer
tokenizer). This is because subwords don’t make sense for my application.
Will the GPT2LMHeadModel
and GPT2Tokenizer
be able to accept word level tokenization?
Because a Tokenizer(WordLevel())
tokenizer saves to a single json file
tokenizer = Tokenizer(WordLevel())
tokenizer.train(files=["words.txt"])
tokenizer.save('wordlevel.json')
but the GPT2Tokenizer
reads from 2 files and is based on a byte-level Byte-Pair-Encoding.
tokenizer = GPT2Tokenizer(
os.path.join("vocab.json"),
os.path.join("merges.txt"),
)
If byte-level byte-pair encoding must be used, can ByteLevelBPETokenizer
be configured to do word level tokenization?
Thanks