I wanted to add some tokens to a GPTNeoXTokenizerFast
tokenizer, but since there is no train function I had to use the train_new_from_iterator
method from the inheretedPreTrainedTokenizerFast
class.
I saved the new vocab and merges and added them to the original vocab and merges, and saved the results in a new tokenizer.json file. However when I try to load the ‘new’ tokenizer I get the following error:
The GPTNeoXSdpaAttention
class is deprecated in favor of simply modifying the config._attn_implementation
attribute of the GPTNeoXAttention
class! It will be removed in v4.48
Traceback (most recent call last):
File “C:\Users\echagnon\PycharmProjects\mass_spec\forge_chem_test.py”, line 4, in
tokenizer = GPTNeoXTokenizerFast.from_pretrained(“combined_tokenizer/”, local_files_only=True)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2036, in from_pretrained
return cls._from_pretrained(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_base.py”, line 2276, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\models\gpt_neox\tokenization_gpt_neox_fast.py”, line 106, in init
super().init(
File “C:\Users\echagnon\anaconda3\envs\mass_spec\lib\site-packages\transformers\tokenization_utils_fast.py”, line 117, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum MergeType at line 105581 column 1
Here is how I created the new tokenizer.json file which I am trying to read:
tokenizer = GPTNeoXTokenizerFast.from_pretrained("path", local_files_only=True)
files = pd.read_csv('data/data.csv')['field'].to_list()
desired_tokens_to_add = 1000
ASCII_VOCAB_SIZE = 258
new_tokenizer = tokenizer.train_new_from_iterator(files, vocab_size=ASCII_VOCAB_SIZE + desired_tokens_to_add, max_token_length=5)
new_tokenizer.save_pretrained('new_path')
original_vocab = tokenizer.get_vocab() # {'token': id}
new_vocab = selfies_tokenizer.get_vocab() # {'token': id}
combined_vocab = {} idx = 0 # so new tokens have new id values
for token in original_vocab.keys():
if token not in combined_vocab.keys():
combined_vocab[token] = idx
idx += 1
for token in new_vocab.keys():
if token not in combined_vocab.keys():
combined_vocab[token] = idx
idx += 1
with open('path/tokenizer.json', encoding="utf8") as f:
original_json = json.load(f)
with open('new_path/tokenizer.json', encoding="utf8") as f:
new_json = json.load(f)
old_merges = original_json['model']['merges'] # [[]]
new_merges = new_json['model']['merges'] # [[]]
combined_merges = old_merges + new_merges
final_json = copy.deepcopy(original_json)
final_json['model']['merges'] = combined_merges
final_json['model']['vocab'] = combined_vocab
with open('final_path/tokenizer.json', 'w', encoding="utf8") as fp:
print('saving new .json file')
json.dump(final_json, fp, ensure_ascii=False)
Note: I copied the special_tokens_map.json and tokenizer_config.json from the directory containing the original tokenizer