Multi Node GPU: `connecting to address with family 7299 is neither AF_INET(2) nor AF_INET6(10)`

  1. I misinterpreted the above nccl-tests results. It was not testing cross-node communications, only cross-GPU communications within a node.

  2. I realized that I needed to use sudo to get the Infiniband benchmark tests to work, such as ib_write_bw. This was an issue with the ulimit -l, See here and here

With that fixed, I can successfully train cross-machine.

2 Likes