Is it normal of more memory use of DistributedDataParallel than single

Hello , I am new here.
I try to Fine-turn MBart model ( mbart-large-cc25 ), My device : one pc(ubuntu), gpu(10GB) memory * 2
When i fine-turn on single gpu, first load model to gpu , only cost 3(GB) memory and start train it, increase to 8(GB), so i can fine-turn with small batch.
I use DistributedDataParallel to have more batch in fine-turn
But, i load model to 2 gpu , than cost 7GB both, didn’t start training.
Is it normal ?

Here is my init memory code

class SummarizationModule() :
    def __init__(self,local_rank) -> None:
        self.mBartModel = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
        self.mBartTokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-cc25", src_lang="en_XX", tgt_lang="zh_CN")
        self.mLocal_rank = local_rank
        print('Rank : [' , args.local_rank , '] is ready.')
        os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '8787'
        torch.distributed.init_process_group(backend="nccl", init_method="env://", world_size=1 ,rank=self.mLocal_rank )
        self.mBartModel = self.mBartModel.cuda()
        self.mBartModel = DistributedDataParallel(self.mBartModel, find_unused_parameters=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
    args = parser.parse_args()
    mModel = SummarizationModule(args.local_rank)

Use :

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2

There is a slight overhead when using DistributedDataParallel so it’s normal to see a bit more GPU usage yes.

1 Like

Thank you for your help.