Marxav
December 2, 2021, 9:51am
1
Hi,
I would like to use a character-level tokenizer to implement a use-case similar to minGPT play_char that could be used in HuggingFace hub.
My question is: is there an existing HF char-level tokenizer that can be used together with a HF autoregressive model (a.k.a. GPT-like model)?
Thanks!
1 Like
nielsr
December 2, 2021, 10:22am
2
Hi,
We do have character-level tokenizers in the library, but those are not for decoder-only models.
Current character-based tokenizers include:
1 Like
Marxav
March 19, 2022, 11:37am
3
In order to have a HugginFace equivalent to minGPT, I ended-up using:
I enforced the gpt2 tokenizer to use single char tokens with some code like:
chars = [ ‘a’, ‘b’, ‘c’, …]
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=len(chars), initial_alphabet=chars)
The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet , but the gpt2 model performs reasonably well at char-level.
In case anyone looking for a character tokenizer for hugging face transformers, you can check my repo .
3 Likes