Is the Trainer slower than customised loops?

John6666 · July 4, 2025, 12:05am

Hmm, if you want to know more about the technical details of fine-tuning, I think it would be quicker to ask on Hugging Face Discord or Unsloth’s Discord…

Regarding the speed difference between Trainer and PyTorch Trainer, the opposite case can also occur. If you want to make effective use of multi-GPU with Trainer, I think you will need FSDP or DeepSpeed, so there may be some overhead there.

github.com/huggingface/accelerate

The more GPU I use, the slower the training speed.

opened 02:00PM - 27 Oct 21 UTC

closed 12:27PM - 05 Nov 21 UTC

hobbitlzy

I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the st…range thing is, the time spent on one step grows sharply with the number of GPU and the total time using multiple GPUs is similar to single GPU. I directly run the sample code provided on this [link](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling) and the problem still occurs. BTW, I have run the `transformers.trainer` using multiple GPUs on this machine, and the time per step only increae a little on distributed training. The CUDA version shown by `nvidia-smi` is 11.4 and the environment is: - `transformers` version: 4.11.3 - Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid - Python version: 3.7.6 - PyTorch version (GPU?): 1.9.0+cu111 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> The relevant outputs on two GPUs are: ``` FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** cuda:0 10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 0 Local process index: 0 Device: cuda:0 Use FP16 precision: False cuda:1 10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 1 Local process index: 1 Device: cuda:1 Use FP16 precision: False .......................... 10/28/2021 20:22:28 - INFO - __main__ - ***** Running training ***** 10/28/2021 20:22:28 - INFO - __main__ - Num examples = 4627 10/28/2021 20:22:28 - INFO - __main__ - Num Epochs = 3 10/28/2021 20:22:28 - INFO - __main__ - Instantaneous batch size per device = 2 10/28/2021 20:22:28 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 32 10/28/2021 20:22:28 - INFO - __main__ - Gradient Accumulation steps = 8 10/28/2021 20:22:28 - INFO - __main__ - Total optimization steps = 435 0%|▏ | 1/435 [00:11<1:24:51, 11.73s/it] 10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration. 10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration. 32%|███████████████████████████████▌ | 140/435 [02:52<05:42, 1.16s/it] ``` The outputs on single GPU: ``` 10/28/2021 20:26:47 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False ....................... 10/28/2021 20:27:49 - INFO - __main__ - ***** Running training ***** 10/28/2021 20:27:49 - INFO - __main__ - Num examples = 4627 10/28/2021 20:27:49 - INFO - __main__ - Num Epochs = 3 10/28/2021 20:27:49 - INFO - __main__ - Instantaneous batch size per device = 2 10/28/2021 20:27:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 10/28/2021 20:27:49 - INFO - __main__ - Gradient Accumulation steps = 8 10/28/2021 20:27:49 - INFO - __main__ - Total optimization steps = 870 4%|███▉ | 35/870 [00:17<06:34, 2.12it/s] ``` The hightlight positions are tjat the time per step sharply increase on distributed training and the total time is similar in two settings.

Topic		Replies	Views
Decreasing performance when using Accelerate 🤗Accelerate	1	2253	March 8, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	816	April 9, 2023
Is native Pytorch training loop much slower than Trainer? Intermediate	4	528	November 11, 2024
Besides writing your own training loop, is there any other advantage for using it with deepspeed? 🤗Accelerate	2	585	July 4, 2023
Training with Trainer really slow 🤗Transformers	0	1623	June 12, 2023

Is the Trainer slower than customised loops?

Related topics