How to set 'num_training_steps' for the learning rate scheduler?

Let’s say I try to train the model with a custom dataset.

The number of data is 2,000, the batch size is 128 and the number of epochs is 10.
Also, I’d like to step the learning rate scheduler per every iterations.

In this config, I have a question in two situation(single gpu, multi gpu)

When I train with single gpu, the ‘num_training_steps’ for ‘get_cosine_schedule_with_warmup’ can be calculated with (data / batch_size).floor() * num_epochs.
In this config, (2,000 / 128).round() * 10 = 15 * 10 = 150.

But how can I calculate the ‘num_training_steps’ with multi gpu?
Intuitively, ‘num_training_steps’ = (data / batch_size).floor() * num_epochs / num_gpu.
However, when I calculate this
(2,000 / 128).round() * 10 / 4 = 15 * 10 / 4 = 150 / 4 = 37.5
What should I do when the num_training_steps is float like this situation?

Also, what happens when the number of batches assigned to each gpu is different?
Num of Total batch = 15
Num of Gpu = 4
Are the 3 gpu allotted 4 batches each and the other 3 batches allotted?

Thanks for helping.