-
I misinterpreted the above nccl-tests results. It was not testing cross-node communications, only cross-GPU communications within a node.
-
I realized that I needed to use
sudo
to get the Infiniband benchmark tests to work, such asib_write_bw
. This was an issue with theulimit -l
, See here and here
With that fixed, I can successfully train cross-machine.