Multi-gpu training does not optimize as expected

condition 1: 16*8 per GPU
condition 2: 16*1 per GPU
seems learning rate must be 8x