Hi Guys!
I’m using Azure Machine Learning Services for training a text classification model. But training the model on 10 GPU’s isnt as fast as I would expect it to be.
First, I did a benchmark with 1 V100. It took 3 hours to run 5 epochs.
Secondly, I did a run with 10 V100. It took 2 hours to run 5 epochs.
Am I doing something wrong? Or is this normal?
This is my code:
Azure
job = command(
inputs=inputs,
compute=gpu_compute_taget,
environment=curated_env,
code=“./folder”,
command=“python train.py --inputs ${{inputs.data}}”,
experiment_name=“experiment1”,
instance_count=10,
distribution={
“type”: “Pytorch”,
“process_count_per_instance”:1,
}
)
My trainer code is pretty basic I’m using the code on: Fine-tuning a model with the Trainer API - Hugging Face Course
Thank you!