Character-level tokenizer

Hi,

I would like to use a character-level tokenizer to implement a use-case similar to minGPT play_char that could be used in HuggingFace hub.

My question is: is there an existing HF char-level tokenizer that can be used together with a HF autoregressive model (a.k.a. GPT-like model)?

Thanks!

1 Like

Hi,

We do have character-level tokenizers in the library, but those are not for decoder-only models.

Current character-based tokenizers include:

1 Like

In order to have a HugginFace equivalent to minGPT, I ended-up using:

  • gpt2 (decoder-only)

I enforced the gpt2 tokenizer to use single char tokens with some code like:

  • chars = [ ā€˜aā€™, ā€˜bā€™, ā€˜cā€™, ā€¦]
  • new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=len(chars), initial_alphabet=chars)

The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet, but the gpt2 model performs reasonably well at char-level.

In case anyone looking for a character tokenizer for hugging face transformers, you can check my repo.

3 Likes

Hi again everyone,

I am working with bank transactions and some examples of the merchant names are
ā€œTesco Superā€, ā€œTesco Supermaā€, ā€œPaypal * Tescosupermarā€, ā€œZilch * tescosupermarket - 343ā€ which have very similar meaning.

I am currently using a sentence embedding based on average of words (universal sentence embedding) to classify these transactions into supermarkets, hotels, restaurantsā€¦ it works ok, but it could be better, so I want to create a sentence embedding that is based in the average of character embeddings.

Is there a language model in LLM that was trained on character tokens and from which I can
extract a embedding combining the meaning of characters? I know there are some token tokenizers, but which models are compatible with a token tokenizerā€¦

I know there are is Char BERT and Character BERT, but those still use word tokenizer, just that the meaning of each word is based on charactersā€¦ and for me the space " " is just another character, does not always mean a new wordā€¦ and I want my embedding to consider the space is just another character with a meaning, also * is a character with an important meaning.

Thanks a lot!

Randomly stumbled on this as Iā€™m looking for a character-level encoding model myself. However, for your case, Iā€™d suggest trying a model from the MTEB leaderboard that does well with classification and is smaller (and hopefully faster). See if you actually get embeddings that are dissimilar for your examples. If you do, you can fine-tune your model. Feed it triples of (query, positive_example, random_negative_example). The query and positive examples would be like the ones you have. Do some googling on triplet loss function with pytorch for details. Good luck!

Have you seen CANINE and Byt5?