Infrence time increase when using multi-GPU

Hey folks,
I’m trying to minimize my inference time when using XLNet for text classification. I’ve used Deepspeed and it’s integration with Huggingface pipeline. Right now the issue is it takes more time on 4 GPUs than a single GPU. I’m researching for couple of days but didn’t find anything to address this issue. here is my code for prediction

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-classification', model=model,tokenizer=tokenizer, device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
                                    replace_with_kernel_inject=True)
start = time.time()
string = generator(df["sentence"][:1000].tolist())
end = time.time()
print("****************************TIME:*********************************")
print(end - start)

Here is the trace for single GPU

[2022-04-01 10:08:18,690] [INFO] [launch.py:110:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-04-01 10:08:18,690] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-04-01 10:08:18,690] [INFO] [launch.py:123:main] dist_world_size=1
[2022-04-01 10:08:18,690] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0
[2022-04-01 10:08:33,459] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:08:33,460] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:08:33,462] [INFO] [engine.py:122:__init__] Place model to device: 0
****************************TIME:*********************************
1.3194301128387451
[2022-04-01 10:08:35,714] [INFO] [launch.py:210:main] Process 1762972 exits successfully.

and here is trace for 4 GPU:

[2022-04-01 10:10:21,758] [INFO] [launch.py:123:main] dist_world_size=4
[2022-04-01 10:10:21,758] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2022-04-01 10:10:38,733] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:38,734] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,083] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,083] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,202] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,202] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,215] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.1, git-hash=unknown, git-branch=unknown
[2022-04-01 10:10:40,215] [INFO] [engine.py:189:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2022-04-01 10:10:40,216] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
[2022-04-01 10:10:41,220] [INFO] [engine.py:122:__init__] Place model to device: 2
[2022-04-01 10:10:41,220] [INFO] [engine.py:122:__init__] Place model to device: 0
[2022-04-01 10:10:41,227] [INFO] [engine.py:122:__init__] Place model to device: 1
[2022-04-01 10:10:41,230] [INFO] [engine.py:122:__init__] Place model to device: 3
****************************TIME:*********************************
3.8325045108795166
****************************TIME:*********************************
3.8658478260040283
****************************TIME:*********************************
3.9033865928649902
****************************TIME:*********************************
3.921976327896118
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763963 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763966 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763964 exits successfully.
[2022-04-01 10:10:46,797] [INFO] [launch.py:210:main] Process 1763965 exits successfully.

Also based on docs I used this command for execution of the script

deepspeed --num_gpus 4 deep_xlnet.py

I tested with single Tesla T4 and four Tesla T4

Thanks in advance :slightly_smiling_face: