I would like to use a character-level tokenizer to implement a use-case similar to minGPT play_char that could be used in HuggingFace hub.
My question is: is there an existing HF char-level tokenizer that can be used together with a HF autoregressive model (a.k.a. GPT-like model)?
We do have character-level tokenizers in the library, but those are not for decoder-only models.
Current character-based tokenizers include:
In order to have a HugginFace equivalent to minGPT, I ended-up using:
I enforced the gpt2 tokenizer to use single char tokens with some code like:
- chars = [ ‘a’, ‘b’, ‘c’, …]
- new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=len(chars), initial_alphabet=chars)
The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet, but the gpt2 model performs reasonably well at char-level.
In case anyone looking for a character tokenizer for hugging face transformers, you can check my repo.