Unable to load a newly trained tokenizer from local files

I wanted to add some tokens to a GPTNeoXTokenizerFast tokenizer, but since there is no train function I had to use the train_new_from_iterator method from the inheretedPreTrainedTokenizerFast class.

I saved the new vocab and merges and added them to the original vocab and merges, and saved the results in a new tokenizer.json file. However when I try to load the ‘new’ tokenizer I get the following error:
The GPTNeoXSdpaAttention class is deprecated in favor of simply modifying the config._attn_implementationattribute of the GPTNeoXAttention class! It will be removed in v4.48
Traceback (most recent call last):
File “C:\Users\echagnon\PycharmProjects\mass_spec\forge_chem_test.py”, line 4, in
tokenizer = GPTNeoXTokenizerFast.from_pretrained(“combined_tokenizer/”, local_files_only=True)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2036, in from_pretrained
return cls._from_pretrained(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2276, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\models\gpt_neox\tokenization_gpt_neox_fast.py”, line 106, in init
super().init(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_fast.py”, line 117, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum MergeType at line 105581 column 1

Here is how I created the new tokenizer.json file which I am trying to read:

tokenizer = GPTNeoXTokenizerFast.from_pretrained("path", local_files_only=True)
files = pd.read_csv('data/data.csv')['field'].to_list()
desired_tokens_to_add = 1000
ASCII_VOCAB_SIZE = 258

new_tokenizer = tokenizer.train_new_from_iterator(files, vocab_size=ASCII_VOCAB_SIZE + desired_tokens_to_add,                                                       max_token_length=5)

new_tokenizer.save_pretrained('new_path')


original_vocab = tokenizer.get_vocab()    # {'token': id} 
new_vocab = selfies_tokenizer.get_vocab() # {'token': id}


combined_vocab = {} idx = 0 # so new tokens have new id values
for token in original_vocab.keys():     
    if token not in combined_vocab.keys():         
         combined_vocab[token] = idx
         idx += 1
for token in new_vocab.keys():
     if token not in combined_vocab.keys():
         combined_vocab[token] = idx
         idx += 1 



with open('path/tokenizer.json', encoding="utf8") as f:
     original_json = json.load(f)  
with open('new_path/tokenizer.json', encoding="utf8") as f: 
    new_json = json.load(f)
old_merges = original_json['model']['merges'] # [[]] 
new_merges = new_json['model']['merges']      # [[]]
combined_merges = old_merges + new_merges


final_json = copy.deepcopy(original_json)
final_json['model']['merges'] = combined_merges 
final_json['model']['vocab'] = combined_vocab


with open('final_path/tokenizer.json', 'w', encoding="utf8") as fp:
     print('saving new .json file')
     json.dump(final_json, fp, ensure_ascii=False)

Note: I copied the special_tokens_map.json and tokenizer_config.json from the directory containing the original tokenizer

1 Like

A slightly older version of transformers gives a similar error, but I wonder if that’s it…

Yes I’ve seen these github issues for similar errors, but these usually have to do with just downloading something from the hub. I also have the most up to date version of the transformers library, so I’m still stumped unfortunately.

1 Like

I found the issue. The new tokenizer had a different format for the mergers than the original.

The original mergers were formatted: [“a b”, “c d”, … “y z”]
The newly trained mergers were formatted [[“a”, “b”], [“c”, “d”], … [“y”, “z”]]

Why this happens I have no idea. I would think that using the
train_new_from_iterator function would keep everything the same. My guess is that the GPTNeoXTokenizerFast and the PreTrainedTokenizerFast have different formats for the mergers, and the only way to train a GPTNeoXTokenizerFast tokenizer (at least from what I saw in the documentation) is to use the train_new_from_iterator from the PreTrainedTokenizerFast class.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.