Dual GPU setup not yield

I suspect I have some configuration problem that causes no gain from a dual GPU setup.
I have a 4080 and a 3090 which have a similar number of CUDA cores, so I’ve expected some gain training Bert Large (toy example).
I’ve tested with a relatively small training size of 10000 sampled from the Yelp review dataset.
What I’ve found 4080 alone is the fastest 54.449 samples/sec (batch size manually set to 18 which was the max fit to vram).
Second fastest 4080 and 3080 together 51.828 samples/sec (batch size manually set to 18)
Slowers 3090 alone 43.912 samples/sec (batch size manually set to 32).

I use multiple loader workers (set 20 without any deeper consideration), pin to memory.

Can this be caused by some sort of overhead and the relatively small training size? (Note: the tests I’ve done haven’t used random seed to sample the training data not sure it that can be responsible for a consistent lower performance with dual GPU)
Any specific setting in training args, I have to consider?

Additional note. With dual GPUI training I not see the 3090 can pull the same wattage than 3090 alone (270W vs 340W). Power supply 1600W EVGA.