Hello everyone,
I trained a character-level model using a custom tokenizer class (not from Hugging Face). Now I want to create a HF tokenizer so users can consume for inference. However, I only see two character-level tokenizers: Canine and ByT5. These tokenizers use âordâ to get the id, instead of a specific order from the vocabulary (e.g., using sort and enumerate). I know ord is more robust, but unfortunately i didnât train my model with this type of tokenization.
I know I can create a tokenizer class inheriting from PreTrainedTokenizer, but how do I upload it to HF? Do i need to create a PR? I donât want force users to install my
library to get the tokenizer.
For reference, my original tokenizer is quite simple:
class Tokenizer(object):
âââConvert between strings and indexâââ
def init(self, protein_alphabet=MSA_ALPHABET, pad=MSA_PAD, mask=MASK, all_aas=MSA_AAS, gap=GAP, start=START,
stop=STOP, sep=SEP):
self.alphabet = list(ââ.join(protein_alphabet))
self.all_aas = list(ââ.join(all_aas))
self.pad = pad
self.mask = mask
self.gap = gap
self.start = start
self.stop = stop
self.sep = sep
self.a_to_i = {u: i for i, u in enumerate(self.alphabet)}
self.i_to_a = np.array(self.alphabet)
def tokenize(self, seq):
return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists
def tokenizeMSA(self, seq):
return np.array([self.a_to_i[a] for a in seq]) # not nested
def untokenize(self, x):
if torch.is_tensor(x):
return "".join([self.i_to_a[int(t.item())] for t in x])
else:
return "".join([self.i_to_a[t] for t in x])
Thanks !