What is Transformers doing? Why it's so slow?

I am using a dataset of over 2 T to train BERT. The tokenization step has been completed, and the model has been trained for 500 epochs. However, after reaching the specified checkpoints (set with --eval_steps and --logging_steps ), the training process has stopped using the GPU and appears to be performing additional tasks that rely on the CPU. It seems that these CPU tasks may take a considerable amount of time to complete.
Below is the log, and I would appreciate any assistance in explaining the ongoing process and providing suggestions to improve its speed.


  0%|          | 499/100000 [152:02:34<24501:24:43, 886.47s/it]
  0%|          | 500/100000 [152:17:13<24442:08:51, 884.34s/it]
                                                               

  0%|          | 500/100000 [152:17:13<24442:08:51, 884.34s/it]

  0%|          | 0/13564759 [00:00<?, ?it/s]e[A

  0%|          | 2/13564759 [00:05<10294:25:46,  2.73s/it]e[A

  0%|          | 3/13564759 [00:12<16387:56:31,  4.35s/it]e[A

  0%|          | 4/13564759 [00:17<17900:02:18,  4.75s/it]e[A

  0%|          | 5/13564759 [00:23<19099:21:20,  5.07s/it]e[A