So I want to create a tokenizer, that instead of learning the tokens takes a list of predifinied tokens (in my case DNA dimers (“AT”, “AG”,“AC”,“AT”…) and so on. This works for the BertWordPieceTokenizer
from tokenizers import BertWordPieceTokenizer
# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
clean_text=False,
handle_chinese_chars=False,
strip_accents=False,
lowercase=False,
unk_token="[UNK]",
sep_token="[SEP]",
cls_token="[CLS]",
mask_token="[MASK]",
pad_token="[PAD]"
)
tokenizer.add_tokens(["AA","AT","AG", "AC","TA", "TT", "TG", "TC", "GA", "GT", "GG", "GC", "CA", "CT", "CC", "CG"])
tokenizer.add_special_tokens(['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])
howerver saving this does not work.
tokenizer.save_model(…) creates empty files, and while tokenizer.save() creates a file, it cannot be reloaded due to
TypeError: sep_token not found in the vocabulary
Is there a better suited tokenizer class/ other way of saving/loading tokenizer?
tokenizers version 0.10.1
transformers version 4.16.2