Character level tokenizer with specific order

Hello everyone,

I trained a character-level model using a custom tokenizer class (not from Hugging Face). Now I want to create a HF tokenizer so users can consume for inference. However, I only see two character-level tokenizers: Canine and ByT5. These tokenizers use “ord” to get the id, instead of a specific order from the vocabulary (e.g., using sort and enumerate). I know ord is more robust, but unfortunately i didn’t train my model with this type of tokenization.

I know I can create a tokenizer class inheriting from PreTrainedTokenizer, but how do I upload it to HF? Do i need to create a PR? I don’t want force users to install my
library to get the tokenizer.

For reference, my original tokenizer is quite simple:

class Tokenizer(object):
“”“Convert between strings and index”“”
def init(self, protein_alphabet=MSA_ALPHABET, pad=MSA_PAD, mask=MASK, all_aas=MSA_AAS, gap=GAP, start=START,
stop=STOP, sep=SEP):
self.alphabet = list(“”.join(protein_alphabet))
self.all_aas = list(“”.join(all_aas))
self.pad = pad
self.mask = mask
self.gap = gap
self.start = start
self.stop = stop
self.sep = sep
self.a_to_i = {u: i for i, u in enumerate(self.alphabet)}
self.i_to_a = np.array(self.alphabet)

def tokenize(self, seq):
    return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists

def tokenizeMSA(self, seq):
    return np.array([self.a_to_i[a] for a in seq]) # not nested

def untokenize(self, x):
    if torch.is_tensor(x):
        return "".join([self.i_to_a[int(t.item())] for t in x])
    else:
        return "".join([self.i_to_a[t] for t in x])

Thanks !

1 Like

Hello. I found a guide for adding models (not tokenizers but maybe similar case). It seems that the procedure is done on github. It doesn’t seem to be as difficult as I thought.

I see, thank you. So there is no simple way of reusing what’s already available in HF library ?

1 Like

@John6666 Do you think it’s possible to upload the code in my HF repo and use trust_remote_code to use the tokenizer?

1 Like

I see. It probably works. Unless you want to incorporate it as a new architecture into the code of the Hugging Face library itself, it seems like you can use it from AutoClass by writing tokenizer_config.json and setting trust_remote_code=True.

AutoTokenizer で chiTra ăƒˆăƒŒă‚ŻăƒŠă‚€ă‚¶ă‚’èȘ­ăżèŸŒă‚€ #transformers - Qiita (in Japanese)

@John6666
I’m trying this but failing with this error: Custom Tokenizer Error - Please Help!

Any idea?