How does num_images_per_prompt work internally?

Can you explain me how does num_images_per_prompt work internally on StableDiffusionPipeline? It seems to me that on my system at least, it’s behaving very different compared to running the pipe multiple times in sequence.

With MPS backend, it actually throws an exception when n = 2.
With CPU backend, it makes my computer very slow and generates images much slower compared to if it did them in sequence. For example n=1 => 4 s/it, n=4 => 32 s/it.

Does it try to run all the generations in parallel? I guess it was made for CUDA, but does it bring any speedup on CUDA compred to a for loop?