Hmm, if you want to know more about the technical details of fine-tuning, I think it would be quicker to ask on Hugging Face Discord or Unsloth’s Discord…
Regarding the speed difference between Trainer and PyTorch Trainer, the opposite case can also occur. If you want to make effective use of multi-GPU with Trainer, I think you will need FSDP or DeepSpeed, so there may be some overhead there.
Hi all,
I am working on a text classification task with a “distilbert-base-uncased” checkpoint and the dataset “emotion”. When I finetune the model, I average 0.34s/it when using the HF function Trainer but when I use the native Pytorch training I get 29.16s/it. What am I doing wrong? Below are the two snippets, the bulk of the code is taken from Fine-tune a pretrained model .
training_args = TrainingArguments(
output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_d…
opened 02:00PM - 27 Oct 21 UTC
closed 12:27PM - 05 Nov 21 UTC
I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the st… range thing is, the time spent on one step grows sharply with the number of GPU and the total time using multiple GPUs is similar to single GPU. I directly run the sample code provided on this [link](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling) and the problem still occurs. BTW, I have run the `transformers.trainer` using multiple GPUs on this machine, and the time per step only increae a little on distributed training.
The CUDA version shown by `nvidia-smi` is 11.4 and the environment is:
- `transformers` version: 4.11.3
- Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
- Python version: 3.7.6
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
The relevant outputs on two GPUs are:
```
FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
cuda:0
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False
cuda:1
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False
..........................
10/28/2021 20:22:28 - INFO - __main__ - ***** Running training *****
10/28/2021 20:22:28 - INFO - __main__ - Num examples = 4627
10/28/2021 20:22:28 - INFO - __main__ - Num Epochs = 3
10/28/2021 20:22:28 - INFO - __main__ - Instantaneous batch size per device = 2
10/28/2021 20:22:28 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 32
10/28/2021 20:22:28 - INFO - __main__ - Gradient Accumulation steps = 8
10/28/2021 20:22:28 - INFO - __main__ - Total optimization steps = 435
0%|▏ | 1/435 [00:11<1:24:51, 11.73s/it]
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
32%|███████████████████████████████▌ | 140/435 [02:52<05:42, 1.16s/it]
```
The outputs on single GPU:
```
10/28/2021 20:26:47 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Use FP16 precision: False
.......................
10/28/2021 20:27:49 - INFO - __main__ - ***** Running training *****
10/28/2021 20:27:49 - INFO - __main__ - Num examples = 4627
10/28/2021 20:27:49 - INFO - __main__ - Num Epochs = 3
10/28/2021 20:27:49 - INFO - __main__ - Instantaneous batch size per device = 2
10/28/2021 20:27:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16
10/28/2021 20:27:49 - INFO - __main__ - Gradient Accumulation steps = 8
10/28/2021 20:27:49 - INFO - __main__ - Total optimization steps = 870
4%|███▉ | 35/870 [00:17<06:34, 2.12it/s]
```
The hightlight positions are tjat the time per step sharply increase on distributed training and the total time is similar in two settings.