Pipeline-parrallel vram consumation

I’m using transformers and PEFT to QLoRA a LLM model, InternLM-7B. I run the same code with batch size=2 for all the three configurations:
1. When using only 1 GPU, the VRAM consumed by is about 20GB.
2. But with 2 GPUs by pipeline-parrallel, the consumation sums to 28GB,
3. and with 4 GPUs it even sums up to 40GB.
And I have also observed the same phenomenon with Falcon-40B model. I am confused by the increasement of VRAM:
1. is there some redundancy in the optimizer or pipeline-parrallel mechanism?
2. how can I alleviate this increasement? because this make the available batch-size much smaller.