Character level tokenizer with specific order

samirchar · February 5, 2025, 2:54pm

Hello everyone,

I trained a character-level model using a custom tokenizer class (not from Hugging Face). Now I want to create a HF tokenizer so users can consume for inference. However, I only see two character-level tokenizers: Canine and ByT5. These tokenizers use “ord” to get the id, instead of a specific order from the vocabulary (e.g., using sort and enumerate). I know ord is more robust, but unfortunately i didn’t train my model with this type of tokenization.

I know I can create a tokenizer class inheriting from PreTrainedTokenizer, but how do I upload it to HF? Do i need to create a PR? I don’t want force users to install my
library to get the tokenizer.

For reference, my original tokenizer is quite simple:

class Tokenizer(object):
“”“Convert between strings and index”“”
def init(self, protein_alphabet=MSA_ALPHABET, pad=MSA_PAD, mask=MASK, all_aas=MSA_AAS, gap=GAP, start=START,
stop=STOP, sep=SEP):
self.alphabet = list(“”.join(protein_alphabet))
self.all_aas = list(“”.join(all_aas))
self.pad = pad
self.mask = mask
self.gap = gap
self.start = start
self.stop = stop
self.sep = sep
self.a_to_i = {u: i for i, u in enumerate(self.alphabet)}
self.i_to_a = np.array(self.alphabet)

def tokenize(self, seq):
    return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists

def tokenizeMSA(self, seq):
    return np.array([self.a_to_i[a] for a in seq]) # not nested

def untokenize(self, x):
    if torch.is_tensor(x):
        return "".join([self.i_to_a[int(t.item())] for t in x])
    else:
        return "".join([self.i_to_a[t] for t in x])

Thanks !

John6666 · February 5, 2025, 3:14pm

Hello. I found a guide for adding models (not tokenizers but maybe similar case). It seems that the procedure is done on github. It doesn’t seem to be as difficult as I thought.

samirchar · February 5, 2025, 5:03pm

I see, thank you. So there is no simple way of reusing what’s already available in HF library ?

samirchar · February 5, 2025, 5:16pm

@John6666 Do you think it’s possible to upload the code in my HF repo and use trust_remote_code to use the tokenizer?

John6666 · February 6, 2025, 6:29am

I see. It probably works. Unless you want to incorporate it as a new architecture into the code of the Hugging Face library itself, it seems like you can use it from AutoClass by writing tokenizer_config.json and setting trust_remote_code=True.

AutoTokenizer で chiTra トークナイザを読み込む #transformers - Qiita (in Japanese)

samirchar · February 7, 2025, 4:49pm

@John6666
I’m trying this but failing with this error: Custom Tokenizer Error - Please Help!

Any idea?

Topic		Replies	Views
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	817	June 26, 2023
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	534	May 4, 2021
Unable to upload custom Pytorch model in huggingface 🤗Tokenizers	0	379	April 4, 2023
Upload a TF model to Huggingface Intermediate	6	1078	September 1, 2021
Loading pretrained SentencePiece tokenizer from Fairseq 🤗Tokenizers	5	6501	October 21, 2020

Character level tokenizer with specific order

Related topics