Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper

Hello ! I am having issue loading a Tokenizer.from_file() BPE tokenizer.
When I try I am encountering this error where the line 11743 is the last last one:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11743 column 3
I have no idea what is the problem and how to solve it
does anyone have some clue?
I did not train directly the BPE but the structure is the correct one so vocab and merges in a json. What I did was from a BPE trained by me (that was working) change completely the vocab and the merges based on something manually created by me (without a proper train). But I don’t see the problem since the structure should be the same as the original one.
My tokenizer version is: 0.13.1

{
  "version":"1.0",
  "truncation":null,
  "padding":null,
  "added_tokens":[
    {
      "id":0,
      "content":"[UNK]",
      "single_word":false,
      "lstrip":false,
      "rstrip":false,
      "normalized":false,
      "special":true
    },
    {
      "id":1,
      "content":"[CLS]",
      "single_word":false,
      "lstrip":false,
      "rstrip":false,
      "normalized":false,
      "special":true
    },
    {
      "id":2,
      "content":"[SEP]",
      "single_word":false,
      "lstrip":false,
      "rstrip":false,
      "normalized":false,
      "special":true
    },
    {
      "id":3,
      "content":"[PAD]",
      "single_word":false,
      "lstrip":false,
      "rstrip":false,
      "normalized":false,
      "special":true
    },
    {
      "id":4,
      "content":"[MASK]",
      "single_word":false,
      "lstrip":false,
      "rstrip":false,
      "normalized":false,
      "special":true
    }
  ],
  "normalizer":null,
  "pre_tokenizer":{
    "type":"Whitespace"
  },
  "post_processor":null,
  "decoder":null,
  "model":{
    "type":"BPE",
    "dropout":null,
    "unk_token":"[UNK]",
    "continuing_subword_prefix":null,
    "end_of_word_suffix":null,
    "fuse_unk":false,
    "vocab":{
      "[UNK]":0,
      "[CLS]":1,
      "[SEP]":2,
      "[PAD]":3,
      "[MASK]":4,
      "AA":5,
      "A":6,
      "C":7,
      "D":8,
.....

merges:

....
      "QD FLPDSITF",
      "QPHY AS",
      "LR SE",
      "A DRV"
    ] #11742
  } #11743
} #11744

@Chrode Would you please let me know, how did you solved this problem?

I had to retrain the tokenizer to make it work. During the retraining, I had to initialize the tokenizer with a pre_tokenizer;

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace() # this is the line I was missing which caused the error
trainer = trainers.BpeTrainer(
    vocab_size=10_000,
    min_frequency=5
)
tokenizer.train(['./input.txt'], trainer)
tokenizer.save('./bpe_tokenizer.json')
tokenizer1 = Tokenizer.from_file('./bpe_tokenizer.json')
tokenizer2 = PreTrainedTokenizerFast(tokenizer_file='./bpe_tokenizer.json')
1 Like

@arnab9learns unfortunately i have not but @gundeep this works thanks!