Character-level tokenizer


I would like to use a character-level tokenizer to implement a use-case similar to minGPT play_char that could be used in HuggingFace hub.

My question is: is there an existing HF char-level tokenizer that can be used together with a HF autoregressive model (a.k.a. GPT-like model)?



We do have character-level tokenizers in the library, but those are not for decoder-only models.

Current character-based tokenizers include:

1 Like

In order to have a HugginFace equivalent to minGPT, I ended-up using:

  • gpt2 (decoder-only)

I enforced the gpt2 tokenizer to use single char tokens with some code like:

  • chars = [ ‘a’, ‘b’, ‘c’, …]
  • new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=len(chars), initial_alphabet=chars)

The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet, but the gpt2 model performs reasonably well at char-level.

In case anyone looking for a character tokenizer for hugging face transformers, you can check my repo.

1 Like