Loading a large dataset occupies ~2GB on each GPU

Hi,

We are calling dataset = load_dataset(dataset_name) in a DDP setting, meaning that every GPU has its own process and executes this when it can.
We see that after loading the dataset, every GPU has roughly 2GB allocated memory. Weirdly, this seems to appear only near the end of our training.
During load_dataset we also see high GPU utilization when each GPUs process takes its turn. Why is that? I thought this function doesn’t use the GPU at all.

Any insights would be greatly appreciated.