Thank you very much for the
We are currently experiencing a difficulty and were wondering if this could be a known case.
We want to run a training with
deepspeed on 4 nodes with 4 GPUs each. However, we see in our logs that 4 processes consider to be both a
main_process and a
local_main_process. We would have expected to see 1
main_process and 4
Is what we expected wrong? Do you see a mistake we could have made?
Thanks a lot in advance!