Unable to load a newly trained tokenizer from local files

echagnon · January 15, 2025, 10:53pm

I wanted to add some tokens to a GPTNeoXTokenizerFast tokenizer, but since there is no train function I had to use the train_new_from_iterator method from the inheretedPreTrainedTokenizerFast class.

I saved the new vocab and merges and added them to the original vocab and merges, and saved the results in a new tokenizer.json file. However when I try to load the ‘new’ tokenizer I get the following error:
The GPTNeoXSdpaAttention class is deprecated in favor of simply modifying the config._attn_implementationattribute of the GPTNeoXAttention class! It will be removed in v4.48
Traceback (most recent call last):
File “C:\Users\echagnon\PycharmProjects\mass_spec\forge_chem_test.py”, line 4, in
tokenizer = GPTNeoXTokenizerFast.from_pretrained(“combined_tokenizer/”, local_files_only=True)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2036, in from_pretrained
return cls._from_pretrained(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2276, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\models\gpt_neox\tokenization_gpt_neox_fast.py”, line 106, in init
super().init(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_fast.py”, line 117, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum MergeType at line 105581 column 1

Here is how I created the new tokenizer.json file which I am trying to read:

tokenizer = GPTNeoXTokenizerFast.from_pretrained("path", local_files_only=True)
files = pd.read_csv('data/data.csv')['field'].to_list()
desired_tokens_to_add = 1000
ASCII_VOCAB_SIZE = 258

new_tokenizer = tokenizer.train_new_from_iterator(files, vocab_size=ASCII_VOCAB_SIZE + desired_tokens_to_add,                                                       max_token_length=5)

new_tokenizer.save_pretrained('new_path')


original_vocab = tokenizer.get_vocab()    # {'token': id} 
new_vocab = selfies_tokenizer.get_vocab() # {'token': id}


combined_vocab = {} idx = 0 # so new tokens have new id values
for token in original_vocab.keys():     
    if token not in combined_vocab.keys():         
         combined_vocab[token] = idx
         idx += 1
for token in new_vocab.keys():
     if token not in combined_vocab.keys():
         combined_vocab[token] = idx
         idx += 1 



with open('path/tokenizer.json', encoding="utf8") as f:
     original_json = json.load(f)  
with open('new_path/tokenizer.json', encoding="utf8") as f: 
    new_json = json.load(f)
old_merges = original_json['model']['merges'] # [[]] 
new_merges = new_json['model']['merges']      # [[]]
combined_merges = old_merges + new_merges


final_json = copy.deepcopy(original_json)
final_json['model']['merges'] = combined_merges 
final_json['model']['vocab'] = combined_vocab


with open('final_path/tokenizer.json', 'w', encoding="utf8") as fp:
     print('saving new .json file')
     json.dump(final_json, fp, ensure_ascii=False)

Note: I copied the special_tokens_map.json and tokenizer_config.json from the directory containing the original tokenizer

John6666 · January 16, 2025, 4:29am

A slightly older version of transformers gives a similar error, but I wonder if that’s it…

github.com/unslothai/unsloth

[FIXED] Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3

opened 08:06PM - 25 Sep 24 UTC

djannot

fixed - pending confirmation URGENT BUG

I get this error: ``` Traceback (most recent call last): File "/home/deni…s/Documents/ai/unsloth/llama3-chat-template.py", line 20, in <module> model, tokenizer = FastLanguageModel.from_pretrained( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/unsloth/models/loader.py", line 323, in from_pretrained model, tokenizer = dispatch_model.from_pretrained( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/unsloth/models/llama.py", line 1610, in from_pretrained tokenizer = load_correct_tokenizer( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/unsloth/tokenizer_utils.py", line 538, in load_correct_tokenizer tokenizer = _load_correct_tokenizer( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/unsloth/tokenizer_utils.py", line 496, in _load_correct_tokenizer fast_tokenizer = AutoTokenizer.from_pretrained( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 897, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2271, in from_pretrained return cls._from_pretrained( File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2505, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/denis/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 115, in __init__ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3 ``` It works with `nsloth/Llama-3.2-1B-Instruct-bnb-4bit`

echagnon · January 16, 2025, 4:56am

Yes I’ve seen these github issues for similar errors, but these usually have to do with just downloading something from the hub. I also have the most up to date version of the transformers library, so I’m still stumped unfortunately.

echagnon · January 16, 2025, 5:45am

I found the issue. The new tokenizer had a different format for the mergers than the original.

The original mergers were formatted: [“a b”, “c d”, … “y z”]
The newly trained mergers were formatted [[“a”, “b”], [“c”, “d”], … [“y”, “z”]]

Why this happens I have no idea. I would think that using the
train_new_from_iterator function would keep everything the same. My guess is that the GPTNeoXTokenizerFast and the PreTrainedTokenizerFast have different formats for the mergers, and the only way to train a GPTNeoXTokenizerFast tokenizer (at least from what I saw in the documentation) is to use the train_new_from_iterator from the PreTrainedTokenizerFast class.

system · January 16, 2025, 5:46pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Loading local tokenizer (RobertaTokenizerFast.from_pretrained) 🤗Transformers	0	1622	June 14, 2023
Can't load tokenizer using from_pretrained, Inference API 🤗Tokenizers	4	1797	May 6, 2024
Loading SentencePiece tokenizer Beginners	3	4994	October 24, 2023
AutoTokenizer.from_pretrained() suddenly raises an error 🤗Transformers	4	80	May 7, 2025
Loss exploding/increasing in pretraining 🤗Transformers	1	1113	June 8, 2023

Unable to load a newly trained tokenizer from local files

Related topics