Thanks @muellerzr . Will do. I was finally able to get a “good node” that would run without this error by just adding to the SLURM --exclude
list, and after about 6 tries it worked.
I’ll open the issue to see if we can maybe figure out what distinguishes a “good node” from a “bad node”