Training of diffusion

I have a question, during the training of diffusion, if it is a conditional input, (for example, an image generation model that combines text) it seems that the crossattention in the network does not use the window attention similar to the swing transformer structure, but directly uses the global attention. Considering that the size of the input image is 512*512, is this the reason why diffusion is so difficult to train?
But it does not seem good to change the global attention to window attention, because the text controls the generation of the whole image, not based on the window.
I don’t know if my understanding is correct or not, I hope for your answers.