Streaming dataset freezes with multi-gpu

Hi @lvwerra
I am running codeparrot provided in the huggingface. Using this command:

accelerate launch scripts/ \
--model_ckpt codeparrot/codeparrot-small \
--dataset_name_train ./data/codeparrot-clean-train \
--dataset_name_valid ./data/codeparrot-clean-valid \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 2000 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 150000 \
--save_checkpoint_steps 15000

the code freezes on the multi-gpu setting, I see similar reports in the datasets library here datasets freezes with streaming mode in multiple-gpu · Issue #5123 · huggingface/datasets · GitHub

Is any specific setting needed to run this script? Have you encouter the freezing with this script?

thanks for any help on this.

Hi @rkarimi

Could this be related to this issue:

Can you try to increase the chunk_size?

cc @loubnabnl

1 Like

Can you make sure you’re using Pytorch 1.11, it seems that ShuffleIterDataPipe changed in recent versions and this shuffling makes the dataloader stuck.

And does it work for you in non streaming mode? I think the issue should persist even without it.