Customized tokenization files in run_clm script

lianghsun · August 10, 2022, 7:15am

Hi there, I have the same problem like Load custom pretrained tokenizer this one. I have used Tokenizers to train a tokenization json file, and I would like to use it in script. The script i wrote for setting is:

python -m torch.distributed.launch --nproc_per_node 1 run_clm-4.8.0.py \
  --model_type gpt2 \
  --train_file ./dataset/train_dataset.txt \
  --use_fast_tokenizer true \
  --tokenizer_name ./tokenization/my_tokenizer.json

However, I got the following errors:

 File "run_clm-4.8.0.py", line 492, in <module>
    main()
  File "run_clm-4.8.0.py", line 308, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 498, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 359, in get_tokenizer_config
    resolved_config_file = get_file_from_repo(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 678, in get_file_from_repo
    resolved_file = cached_path(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
    output_path = get_from_cache(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
    raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 163675) of binary: /home/user/miniconda3/envs/gpt/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm-4.8.0.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-10_14:56:08
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 163675)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Dose anyone have ideas about how to set the customized tokenization file at script? Thanks

jbmaxwell · August 10, 2022, 3:27pm

In my case I found I had to modify the run_clm.py script to use PreTrainedTokenizerFast rather than AutoTokenizer:

tokenizer = PreTrainedTokenizerFast(tokenizer_file=model_args.tokenizer_name, model_max_length=256, mask_token="<mask>", pad_token="<pad>")

lianghsun · August 17, 2022, 6:34am

After recently diving in this problem, I finally figured out how to deal with “Using Tokenizers module to build tokenization map and using Transformers AutoTokenizer.from_pretrained() API”. I’ll write down the details for those who might encounter this problem in the future.

@jbmaxwell 's reply is one of the method that you need to customize the script, however this method is not the best one, bcz this may loss some important tokenization details likes: merges.txt , special_tokens_map.json , tokenizer_config.json , tokenizer.json and vocab.json. The best way to deal this problem is:

# Tokenizers API
tokenizer.save('./path/to/your/tokenization.json')

# Transformer API
from transformers import PreTrainedTokenizerFast
new_tokenizer = PreTrainedTokenizerFast(tokenizer_file='./path/to/your/tokenization.json')
new_tokenizer.add_special_tokens(
    {'bos_token': '[BOS]',
     'eos_token': '[EOS]',
     'unk_token': '[UNK]',
     'sep_token': '[SEP]',
     'pad_token': '[PAD]',
     'cls_token': '[CLS]',
     'mask_token': '[MASK]'}
)
new_tokenizer.save_pretrained('./path/to/your/folder/') # then you have the everything!
>> 
('./path/to/your/folder/tokenizer_config.json',
 './path/to/your/folder/special_tokens_map.json',
 './path/to/your/folder/tokenizer.json'
...)

jbmaxwell · August 18, 2022, 4:19pm

Oops! Yes, you’re absolutely right!

I encountered the same problem when I first used PreTrainedTokenizerFast, but completely forgot to indicate the (essential) detail of adding the special tokens and re-saving. Once you’re re-saved you no longer need to add the tokens (which is probably why that step completely slipped my mind!).

Apologies for the partial answer!

Topic		Replies	Views
Load custom pretrained tokenizer 🤗Tokenizers	0	1615	October 28, 2021
Sshleifer/student_blarge_12_3 does not have a tokenizer_config.json file Model cards	6	1758	May 11, 2021
TypeError when loading tokenizer with from_pretrained method for bart-large-mnli model 🤗Tokenizers	1	1123	July 8, 2021
How to save tokenizer after finetunning distilgpt2 model Beginners	2	588	March 18, 2022
Cannot load pretrained tokenizer from 'IlyaGusev/mbart_ru_sum_gazeta' model Models	0	340	April 21, 2021

Customized tokenization files in run_clm script

Related topics