I’m using this SageMaker HF sample notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub adjusted with train_batch_size = 128, tested on both 1 p3.16xlarge and 1 p4d.24xlarge. For each instance I’m doing a job with fp16=True and a job without the flag. GPU usage is a erratic (sawtooth oscillating between 50 and 750%)
The impact of fp16=True is only a 1% training time reduction, on each instance. Is it because:
- Not specifying fp16 in the trainer already uses fp16? (seems to be false by default though)
- There is a lot of CPU work & I/O in that demo that will not leverage float16?
- Transformers don’t benefit from fp16 training?
can you please share all of your
hyperparameters you used?
tran_batch_size = 128 seems pretty high to me to work.
Are you using
The training time reduction should be way higher more around 20-40%.
When using the example have you adjusted
train.py to accept fp16 as hyperparameter or have you defined it directly in the script?
batch was 32 on 1 V100 so it’s ok to scale 4x on 4*V100 right? my understanding is that
tran_batch_size is the cluster-level batch, not the per-GPU batch right? here is my config. it ran fine both on p3.16x and p4d.24xl, and GPU mem is not 800%
from sagemaker.huggingface import HuggingFace
# hyperparameters, which are passed into the training job
huggingface_estimator = HuggingFace(entry_point='train.py',
hyperparameters = hyperparameters)
train_batch_size = 256 also works fine, it is as fast as 128 on p3.16x and 7% faster than 128 on p4d.24xlarge (both still fp16)
train_batch_size is the batch_size per device. See notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub or in the documentation here Trainer — transformers 4.7.0 documentation
train_batch_size = 256 would have resulted in 256*8 = 2048 which is not possible.
Have you modified
fp16:True will not be used. Since it is not parsed here notebooks/train.py at 6b8367544a374daf187854c6ba2274640fa0d7b4 · huggingface/notebooks · GitHub
Well running train_batch_size = 256 on 1*p3.16xlarge for me worked
Could you share the exact code/notebook you have used? I would like to reproduce it.
So you didn’t change anything?
fp16 couldn’t work because it is not parsed in
train.py. Can you share the logs of your script? There you should be able to see which batch size was used by the
@philschmid I didn’t change anything apart from the config above in the estimator (batch size, fp16, instance type). Did you try and see that batch 256 works fine on p3.16x?
found the issue. the
train.py expected the hyperparamter to be
train-batch-size and we always passed in
train_batch_size so it used the default of
32 and not the values passed in. I fixed all of the examples and tested
256 and it failed.
If you want to test
fp16 you would need to change the
So how would I change the code to enable fp16 computation in this case?