WordLevel Tokenization with GPT2?

I am thinking of pretraining a GPT2 model from scratch where the tokens must be of whole words (e.g. using the WordLevel tokenizer) instead of subwords (e.g. using the ByteLevelBPETokenizer tokenizer). This is because subwords don’t make sense for my application.

Will the GPT2LMHeadModel and GPT2Tokenizer be able to accept word level tokenization?

Because a Tokenizer(WordLevel()) tokenizer saves to a single json file

tokenizer = Tokenizer(WordLevel())
tokenizer.train(files=["words.txt"])
tokenizer.save('wordlevel.json')

but the GPT2Tokenizer reads from 2 files and is based on a byte-level Byte-Pair-Encoding.

tokenizer = GPT2Tokenizer(
    os.path.join("vocab.json"),
    os.path.join("merges.txt"),
)

If byte-level byte-pair encoding must be used, can ByteLevelBPETokenizer be configured to do word level tokenization?

Thanks :slight_smile: