Azure Machine Learning: Training isn't significant faster with multiple GPU's

Hi Guys!

I’m using Azure Machine Learning Services for training a text classification model. But training the model on 10 GPU’s isnt as fast as I would expect it to be.

First, I did a benchmark with 1 V100. It took 3 hours to run 5 epochs.

Secondly, I did a run with 10 V100. It took 2 hours to run 5 epochs.

Am I doing something wrong? Or is this normal?

This is my code:

Azure

job = command(
inputs=inputs,
compute=gpu_compute_taget,
environment=curated_env,
code=“./folder”,
command=“python train.py --inputs ${{inputs.data}}”,
experiment_name=“experiment1”,
instance_count=10,
distribution={
“type”: “Pytorch”,
“process_count_per_instance”:1,
}
)

My trainer code is pretty basic I’m using the code on: Fine-tuning a model with the Trainer API - Hugging Face Course

Thank you!

1 Like

I have the same issue. The data seems to be distributed across multiple nodes, but the total training time does not decrease.