Hi !
Like you mention in this post :
Accelerator .prepare() replaces custom DataLoader Sampler - Accelerate - Hugging Face Forums
When we use a custom sampler, it is used on downstream processes.
But today it works like that :
Sampler will generate batches, and batches will be assigned to different processes depending on their indexes.
When we have a weighted sampler, It’s probable to get the same datarows in multiple batches. It means we can have the same data on multiple processes.
I would like to make sure, each process uses independant data…
Is it currently possible ?
If yes how ?
If no, how should i implement it ?
I created a script in order to reproduce the issue, it is here :
Dataloader WeightedRandomSampler + Distributed Training · Issue #2865 · huggingface/accelerate (github.com)
Thanks for your help and feedback