I am running a training task on GCP TPU instance v2-8. Following script utilize TPUs however, it hangs during the checkpoint stage for a long time and does not complete the task
python ./transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path "gpt2" \
--train_file data.txt \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--do_train \
--output_dir my-gpt \
--overwrite_output_dir \
--log_level debug \
--save_steps 100000 \
--num_train_epochs 40
Latest point at which command is hung up
INFO:datasets.arrow_dataset:Loading cached processed dataset at /home/xxxx/research/transformers/cache/text/default-26a7b93901fc6ac6/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34/cache-fa9d556d4fa5e412.arrow
Loading cached processed dataset at /home/xxxx/research/transformers/cache/text/default-26a7b93901fc6ac6/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34/cache-287f2e388ba71c1a.arrow
INFO:datasets.arrow_dataset:Loading cached processed dataset at /home/xxxx/research/transformers/cache/text/default-26a7b93901fc6ac6/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34/cache-287f2e388ba71c1a.arrow
[2024-01-23 12:58:01,089] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-23 12:58:01,090] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-23 12:58:01,111] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-23 12:58:01,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
28%|████████████████████████▍ | 100/360 [02:32<04:10, 1.04it/s]