Ok I fixed this issue
the random seed was set to the same value on each GPU
this meant that all the dropout masks were the same on each device which led to large and funky gradients
I think this is a gotcha for people and maybe you should handle this internally?