Character-level tokenizer

Marxav · December 2, 2021, 9:51am

Hi,

I would like to use a character-level tokenizer to implement a use-case similar to minGPT play_char that could be used in HuggingFace hub.

My question is: is there an existing HF char-level tokenizer that can be used together with a HF autoregressive model (a.k.a. GPT-like model)?

Thanks!

nielsr · December 2, 2021, 10:22am

Hi,

We do have character-level tokenizers in the library, but those are not for decoder-only models.

Current character-based tokenizers include:

CANINE (encoder-only)
ByT5 (encoder-decoder)

Marxav · March 19, 2022, 11:37am

In order to have a HugginFace equivalent to minGPT, I ended-up using:

gpt2 (decoder-only)

I enforced the gpt2 tokenizer to use single char tokens with some code like:

chars = [ ‘a’, ‘b’, ‘c’, …]
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=len(chars), initial_alphabet=chars)

The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet, but the gpt2 model performs reasonably well at char-level.

dariush-bahrami · October 18, 2022, 2:21pm

In case anyone looking for a character tokenizer for hugging face transformers, you can check my repo.

mauriciotoro · April 29, 2024, 10:11am

Hi again everyone,

I am working with bank transactions and some examples of the merchant names are
“Tesco Super”, “Tesco Superma”, “Paypal * Tescosupermar”, “Zilch * tescosupermarket - 343” which have very similar meaning.

I am currently using a sentence embedding based on average of words (universal sentence embedding) to classify these transactions into supermarkets, hotels, restaurants… it works ok, but it could be better, so I want to create a sentence embedding that is based in the average of character embeddings.

Is there a language model in LLM that was trained on character tokens and from which I can
extract a embedding combining the meaning of characters? I know there are some token tokenizers, but which models are compatible with a token tokenizer…

I know there are is Char BERT and Character BERT, but those still use word tokenizer, just that the meaning of each word is based on characters… and for me the space " " is just another character, does not always mean a new word… and I want my embedding to consider the space is just another character with a meaning, also * is a character with an important meaning.

Thanks a lot!

lilmildred · May 7, 2024, 10:16pm

Randomly stumbled on this as I’m looking for a character-level encoding model myself. However, for your case, I’d suggest trying a model from the MTEB leaderboard that does well with classification and is smaller (and hopefully faster). See if you actually get embeddings that are dissimilar for your examples. If you do, you can fine-tune your model. Feed it triples of (query, positive_example, random_negative_example). The query and positive examples would be like the ones you have. Do some googling on triplet loss function with pytorch for details. Good luck!

Nevermetyou · May 8, 2024, 2:01am

Have you seen CANINE and Byt5?

Topic		Replies	Views
Using HuggingFace Tokenizers Without Special Characters 🤗Tokenizers	2	1923	November 2, 2022
Training GPT-2 from scratch Beginners	2	1227	August 3, 2020
Character level tokenizer with specific order 🤗Tokenizers	5	60	February 7, 2025
Train gpt-2 from scratch in Italian Beginners	0	880	September 8, 2022
WordLevel Tokenization with GPT2? 🤗Transformers	1	732	March 26, 2023

Character-level tokenizer

Related topics