Hi there, I have the same problem like Load custom pretrained tokenizer this one. I have used Tokenizers
to train a tokenization json file, and I would like to use it in script. The script i wrote for setting is:
python -m torch.distributed.launch --nproc_per_node 1 run_clm-4.8.0.py \
--model_type gpt2 \
--train_file ./dataset/train_dataset.txt \
--use_fast_tokenizer true \
--tokenizer_name ./tokenization/my_tokenizer.json
However, I got the following errors:
File "run_clm-4.8.0.py", line 492, in <module>
main()
File "run_clm-4.8.0.py", line 308, in main
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 498, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 359, in get_tokenizer_config
resolved_config_file = get_file_from_repo(
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 678, in get_file_from_repo
resolved_file = cached_path(
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
output_path = get_from_cache(
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 163675) of binary: /home/user/miniconda3/envs/gpt/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/gpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm-4.8.0.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-10_14:56:08
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 163675)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Dose anyone have ideas about how to set the customized tokenization file at script? Thanks