What is Transformers doing? Why it's so slow?

Lucia001 · June 16, 2023, 1:45am

I am using a dataset of over 2 T to train BERT. The tokenization step has been completed, and the model has been trained for 500 epochs. However, after reaching the specified checkpoints (set with --eval_steps and --logging_steps ), the training process has stopped using the GPU and appears to be performing additional tasks that rely on the CPU. It seems that these CPU tasks may take a considerable amount of time to complete.
Below is the log, and I would appreciate any assistance in explaining the ongoing process and providing suggestions to improve its speed.


  0%|          | 499/100000 [152:02:34<24501:24:43, 886.47s/it]
  0%|          | 500/100000 [152:17:13<24442:08:51, 884.34s/it]
                                                               

  0%|          | 500/100000 [152:17:13<24442:08:51, 884.34s/it]

  0%|          | 0/13564759 [00:00<?, ?it/s]e[A

  0%|          | 2/13564759 [00:05<10294:25:46,  2.73s/it]e[A

  0%|          | 3/13564759 [00:12<16387:56:31,  4.35s/it]e[A

  0%|          | 4/13564759 [00:17<17900:02:18,  4.75s/it]e[A

  0%|          | 5/13564759 [00:23<19099:21:20,  5.07s/it]e[A

Topic		Replies	Views
How to continue BERT training 🤗Transformers	1	1342	March 4, 2022
Speed expectations for production BERT models on CPU vs GPU? Beginners	1	2153	October 2, 2020
Slow GPU with mps in Intel 🤗Accelerate	0	1106	April 6, 2023
Is it possible to make the first batch as fast as the subsequent ones? 🤗Optimum	1	86	June 25, 2024
Advice to speed and performance 🤗Transformers	4	7220	December 7, 2020

What is Transformers doing? Why it's so slow?

Related topics