FP16 doesn't reduce Trainer Training time

OlivierCR · July 20, 2021, 1:12pm

Hi,
I’m using this SageMaker HF sample notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub adjusted with train_batch_size = 128, tested on both 1 p3.16xlarge and 1 p4d.24xlarge. For each instance I’m doing a job with fp16=True and a job without the flag. GPU usage is a erratic (sawtooth oscillating between 50 and 750%)

The impact of fp16=True is only a 1% training time reduction, on each instance. Is it because:

Not specifying fp16 in the trainer already uses fp16? (seems to be false by default though)
There is a lot of CPU work & I/O in that demo that will not leverage float16?
Transformers don’t benefit from fp16 training?

philschmid · July 20, 2021, 1:29pm

Hey @OlivierCR,

can you please share all of your hyperparameters you used? tran_batch_size = 128 seems pretty high to me to work.
Are you using distributed training?
The training time reduction should be way higher more around 20-40%.

When using the example have you adjusted train.py to accept fp16 as hyperparameter or have you defined it directly in the script?

OlivierCR · July 21, 2021, 7:46am

batch was 32 on 1 V100 so it’s ok to scale 4x on 4*V100 right? my understanding is that tran_batch_size is the cluster-level batch, not the per-GPU batch right? here is my config. it ran fine both on p3.16x and p4d.24xl, and GPU mem is not 800%

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 3,
                 'train_batch_size': 128,
                 'model_name':'distilbert-base-uncased',
                 'fp16': True}

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p4d.24xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

train_batch_size = 256 also works fine, it is as fast as 128 on p3.16x and 7% faster than 128 on p4d.24xlarge (both still fp16)

philschmid · July 21, 2021, 11:21am

No the train_batch_size is the batch_size per device. See notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub or in the documentation here Trainer — transformers 4.7.0 documentation
This speaking train_batch_size = 256 would have resulted in 256*8 = 2048 which is not possible.

Have you modified train.py otherwise fp16:True will not be used. Since it is not parsed here notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub

OlivierCR · July 21, 2021, 1:21pm

Well running train_batch_size = 256 on 1*p3.16xlarge for me worked

philschmid · July 22, 2021, 8:50am

Could you share the exact code/notebook you have used? I would like to reproduce it.

OlivierCR · July 27, 2021, 1:58pm

philschmid · July 28, 2021, 6:47am

So you didn’t change anything?
If so fp16 couldn’t work because it is not parsed in train.py. Can you share the logs of your script? There you should be able to see which batch size was used by the Trainer

OlivierCR · July 28, 2021, 8:30am

@philschmid I didn’t change anything apart from the config above in the estimator (batch size, fp16, instance type). Did you try and see that batch 256 works fine on p3.16x?

philschmid · July 28, 2021, 11:26am

Hey,

found the issue. the train.py expected the hyperparamter to be train-batch-size and we always passed in train_batch_size so it used the default of 32 and not the values passed in. I fixed all of the examples and tested 256 and it failed.

If you want to test fp16 you would need to change the train.py

sd3ntato · June 29, 2023, 9:13am

Hi,

So how would I change the code to enable fp16 computation in this case?

Topic		Replies	Views
Hyperparameter-Tuning on Sagemaker - FP16 parameter not responsive Amazon SageMaker	0	20	December 9, 2024
Fp16, bf16 in TrainingArgs vs BitsAndBytesConfig Beginners	0	789	June 30, 2023
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1685	June 16, 2023
Larger instance types to do not reduce training time? Amazon SageMaker	2	1055	February 8, 2022
How to train huggingface model with fp16? Beginners	1	1536	May 23, 2022

FP16 doesn't reduce Trainer Training time

Related topics