Dataloader fetches slowly using accelerator for distributed training

Hi, I am using multiple GPUs by accelerate. The problem I encounter is that if I maintain the batch size on each GPU, the training time per step increases shaply as more GPUs are included into training. Then I break down the time and locate at fetching data batch from dataloader. The times are 0.01s/ite,0.09s/ite, and 0.2s/ite when I use 1, 2 and 4 GPUs, respectively. This harms the efficiency since if I set the accumulation_step as 8, there would be around 1.6s spent on fetching data for 4-GPU. This device works fine when I use the transformers.trainer for distributed training. Therefore, I think the device works fine, maybe I mistake something so that this problem happens. Anyone has idea how to solve it? Looking forward to your reply!