Azure Machine Learning: Training isn't significant faster with multiple GPU's

Louisiana · February 2, 2023, 12:34pm

Hi Guys!

I’m using Azure Machine Learning Services for training a text classification model. But training the model on 10 GPU’s isnt as fast as I would expect it to be.

First, I did a benchmark with 1 V100. It took 3 hours to run 5 epochs.

Secondly, I did a run with 10 V100. It took 2 hours to run 5 epochs.

Am I doing something wrong? Or is this normal?

This is my code:

Azure

job = command(
inputs=inputs,
compute=gpu_compute_taget,
environment=curated_env,
code=“./folder”,
command=“python train.py --inputs ${{inputs.data}}”,
experiment_name=“experiment1”,
instance_count=10,
distribution={
“type”: “Pytorch”,
“process_count_per_instance”:1,
}
)

My trainer code is pretty basic I’m using the code on: Fine-tuning a model with the Trainer API - Hugging Face Course

Thank you!

Pepio3 · February 2, 2023, 12:55pm

I have the same issue. The data seems to be distributed across multiple nodes, but the total training time does not decrease.

Topic		Replies	Views
Single GPU is faster than multiple GPUs 🤗Accelerate	3	1919	January 31, 2024
Cost-effective Cloud Environments for Training Beginners	1	1072	December 28, 2023
Very slow training (>5mins per batch) - code review request Research	2	641	October 11, 2023
More GPUs = lower performance? Beginners	1	521	December 31, 2020
Trainer.predict in parallel not supported! 🤗Transformers	2	666	November 2, 2022

Azure Machine Learning: Training isn't significant faster with multiple GPU's

Related topics