I’m using this SageMaker HF sample notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub adjusted with train_batch_size = 128, tested on both 1 p3.16xlarge and 1 p4d.24xlarge. For each instance I’m doing a job with fp16=True and a job without the flag. GPU usage is a erratic (sawtooth oscillating between 50 and 750%)
The impact of fp16=True is only a 1% training time reduction, on each instance. Is it because:
- Not specifying fp16 in the trainer already uses fp16? (seems to be false by default though)
- There is a lot of CPU work & I/O in that demo that will not leverage float16?
- Transformers don’t benefit from fp16 training?