FP16 doesn't reduce Trainer Training time

Hi,
I’m using this SageMaker HF sample notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub adjusted with train_batch_size = 128, tested on both 1 p3.16xlarge and 1 p4d.24xlarge. For each instance I’m doing a job with fp16=True and a job without the flag. GPU usage is a erratic (sawtooth oscillating between 50 and 750%)

The impact of fp16=True is only a 1% training time reduction, on each instance. Is it because:

  1. Not specifying fp16 in the trainer already uses fp16? (seems to be false by default though)
  2. There is a lot of CPU work & I/O in that demo that will not leverage float16?
  3. Transformers don’t benefit from fp16 training?

Hey @OlivierCR,

can you please share all of your hyperparameters you used? tran_batch_size = 128 seems pretty high to me to work.
Are you using distributed training?
The training time reduction should be way higher more around 20-40%.

When using the example have you adjusted train.py to accept fp16 as hyperparameter or have you defined it directly in the script?

batch was 32 on 1 V100 so it’s ok to scale 4x on 4*V100 right? my understanding is that tran_batch_size is the cluster-level batch, not the per-GPU batch right? here is my config. it ran fine both on p3.16x and p4d.24xl, and GPU mem is not 800%

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 3,
                 'train_batch_size': 128,
                 'model_name':'distilbert-base-uncased',
                 'fp16': True}

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p4d.24xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

train_batch_size = 256 also works fine, it is as fast as 128 on p3.16x and 7% faster than 128 on p4d.24xlarge (both still fp16)

No the train_batch_size is the batch_size per device. See notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub or in the documentation here Trainer — transformers 4.7.0 documentation
This speaking train_batch_size = 256 would have resulted in 256*8 = 2048 which is not possible.

Have you modified train.py otherwise fp16:True will not be used. Since it is not parsed here notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub

Well running train_batch_size = 256 on 1*p3.16xlarge for me worked

Could you share the exact code/notebook you have used? I would like to reproduce it.

So you didn’t change anything?
If so fp16 couldn’t work because it is not parsed in train.py. Can you share the logs of your script? There you should be able to see which batch size was used by the Trainer

@philschmid I didn’t change anything apart from the config above in the estimator (batch size, fp16, instance type). Did you try and see that batch 256 works fine on p3.16x?

Hey,

found the issue. the train.py expected the hyperparamter to be train-batch-size and we always passed in train_batch_size so it used the default of 32 and not the values passed in. I fixed all of the examples and tested 256 and it failed.

If you want to test fp16 you would need to change the train.py

1 Like