How to save a tokenizer only consisting of added tokens

Gback · May 11, 2022, 2:25pm

So I want to create a tokenizer, that instead of learning the tokens takes a list of predifinied tokens (in my case DNA dimers (“AT”, “AG”,“AC”,“AT”…) and so on. This works for the BertWordPieceTokenizer

from tokenizers import BertWordPieceTokenizer

# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=False,
  unk_token="[UNK]",
  sep_token="[SEP]",
  cls_token="[CLS]",
  mask_token="[MASK]",
  pad_token="[PAD]"
  
)

tokenizer.add_tokens(["AA","AT","AG", "AC","TA", "TT", "TG", "TC", "GA", "GT", "GG", "GC", "CA", "CT", "CC", "CG"])
tokenizer.add_special_tokens(['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])

howerver saving this does not work.

tokenizer.save_model(…) creates empty files, and while tokenizer.save() creates a file, it cannot be reloaded due to

TypeError: sep_token not found in the vocabulary

Is there a better suited tokenizer class/ other way of saving/loading tokenizer?

tokenizers version 0.10.1
transformers version 4.16.2

Topic		Replies	Views
How to save my tokenizer using save_pretrained? Beginners	5	29016	August 13, 2021
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4993	March 26, 2021
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4426	August 10, 2021
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Is there a way to save a pre-compiled AutoTokenizer? 🤗Tokenizers	1	351	January 25, 2024

How to save a tokenizer only consisting of added tokens

Related topics