Customized tokenization files in run_clm script

Hi there, I have the same problem like Load custom pretrained tokenizer this one. I have used Tokenizers to train a tokenization json file, and I would like to use it in script. The script i wrote for setting is:

python -m torch.distributed.launch --nproc_per_node 1 run_clm-4.8.0.py \
  --model_type gpt2 \
  --train_file ./dataset/train_dataset.txt \
  --use_fast_tokenizer true \
  --tokenizer_name ./tokenization/my_tokenizer.json

However, I got the following errors:

 File "run_clm-4.8.0.py", line 492, in <module>
    main()
  File "run_clm-4.8.0.py", line 308, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 498, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 359, in get_tokenizer_config
    resolved_config_file = get_file_from_repo(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 678, in get_file_from_repo
    resolved_file = cached_path(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
    output_path = get_from_cache(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
    raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 163675) of binary: /home/user/miniconda3/envs/gpt/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm-4.8.0.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-10_14:56:08
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 163675)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Dose anyone have ideas about how to set the customized tokenization file at script? Thanks

In my case I found I had to modify the run_clm.py script to use PreTrainedTokenizerFast rather than AutoTokenizer:

tokenizer = PreTrainedTokenizerFast(tokenizer_file=model_args.tokenizer_name, model_max_length=256, mask_token="<mask>", pad_token="<pad>")

After recently diving in this problem, I finally figured out how to deal with “Using Tokenizers module to build tokenization map and using Transformers AutoTokenizer.from_pretrained() API”. I’ll write down the details for those who might encounter this problem in the future.

@jbmaxwell 's reply is one of the method that you need to customize the script, however this method is not the best one, bcz this may loss some important tokenization details likes: merges.txt , special_tokens_map.json , tokenizer_config.json , tokenizer.json and vocab.json. The best way to deal this problem is:

# Tokenizers API
tokenizer.save('./path/to/your/tokenization.json')

# Transformer API
from transformers import PreTrainedTokenizerFast
new_tokenizer = PreTrainedTokenizerFast(tokenizer_file='./path/to/your/tokenization.json')
new_tokenizer.add_special_tokens(
    {'bos_token': '[BOS]',
     'eos_token': '[EOS]',
     'unk_token': '[UNK]',
     'sep_token': '[SEP]',
     'pad_token': '[PAD]',
     'cls_token': '[CLS]',
     'mask_token': '[MASK]'}
)
new_tokenizer.save_pretrained('./path/to/your/folder/') # then you have the everything!
>> 
('./path/to/your/folder/tokenizer_config.json',
 './path/to/your/folder/special_tokens_map.json',
 './path/to/your/folder/tokenizer.json'
...)

Oops! Yes, you’re absolutely right!

I encountered the same problem when I first used PreTrainedTokenizerFast, but completely forgot to indicate the (essential) detail of adding the special tokens and re-saving. Once you’re re-saved you no longer need to add the tokens (which is probably why that step completely slipped my mind!).

Apologies for the partial answer!

1 Like