It seems originally the evaluation was in one single node and then convert to multiple nodes. Issue. There are indeed complex operations to gather the output tensors and data flow between device and host. I was wondering why the evaluation for multiple nodes instead of on one node is better, is it purely for efficiency or there might be other reason ?