Infrence time increase when using multi-GPU

Ali-consensus · April 1, 2022, 10:15am

Hey folks,
I’m trying to minimize my inference time when using XLNet for text classification. I’ve used Deepspeed and it’s integration with Huggingface pipeline. Right now the issue is it takes more time on 4 GPUs than a single GPU. I’m researching for couple of days but didn’t find anything to address this issue. here is my code for prediction

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-classification', model=model,tokenizer=tokenizer, device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
                                    replace_with_kernel_inject=True)
start = time.time()
string = generator(df["sentence"][:1000].tolist())
end = time.time()
print("****************************TIME:*********************************")
print(end - start)

Here is the trace for single GPU

[2022-04-01 10:08:18,690] [INFO] [launch.py:110:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-04-01 10:08:18,690] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-04-01 10:08:18,690] [INFO] [launch.py:123:main] dist_world_size=1
[2022-04-01 10:08:18,690] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0
[2022-04-01 10:08:33,459] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:08:33,460] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:08:33,462] [INFO] [engine.py:122:__init__] Place model to device: 0
****************************TIME:*********************************
1.3194301128387451
[2022-04-01 10:08:35,714] [INFO] [launch.py:210:main] Process 1762972 exits successfully.

and here is trace for 4 GPU:

[2022-04-01 10:10:21,758] [INFO] [launch.py:123:main] dist_world_size=4
[2022-04-01 10:10:21,758] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2022-04-01 10:10:38,733] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:38,734] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,083] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,083] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,202] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,202] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,215] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,215] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,216] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
[2022-04-01 10:10:41,220] [INFO] [engine.py:122:__init__] Place model to device: 2
[2022-04-01 10:10:41,220] [INFO] [engine.py:122:__init__] Place model to device: 0
[2022-04-01 10:10:41,227] [INFO] [engine.py:122:__init__] Place model to device: 1
[2022-04-01 10:10:41,230] [INFO] [engine.py:122:__init__] Place model to device: 3
****************************TIME:*********************************
3.8325045108795166
****************************TIME:*********************************
3.8658478260040283
****************************TIME:*********************************
3.9033865928649902
****************************TIME:*********************************
3.921976327896118
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763963 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763966 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763964 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763965 exits successfully.

Also based on docs I used this command for execution of the script

deepspeed --num_gpus 4 deep_xlnet.py

I tested with single Tesla T4 and four Tesla T4

Thanks in advance

nuxee · November 28, 2023, 5:48am

Have you solved this problem?

Topic		Replies	Views
I have a question about multi-GPU inference DeepSpeed	0	1519	March 9, 2023
How to train on multiple GPUs the Informer model for time series forecasting? 🤗Accelerate	7	2801	August 18, 2023
Multiple gpu not properly parallelized during model.generate() 🤗Transformers	4	1627	October 9, 2022
Timeout Issue with DeepSpeed on Multiple GPUs DeepSpeed	2	552	July 21, 2025
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6465	May 31, 2023

Infrence time increase when using multi-GPU

Related topics