Is it reasonable to add controlent to instruct pix2pix?

I want to use instructpix2pix for arranging items on store shelves, I gather 200 pair before and after images, the before images are empty items (shelves without items) and the after images are full items (shelves with items), The train was I train 5000 steps, the train was successful, but in the inference time or evaluation, in some scenarios the arranging items in store shelves are incorrecnt. The model not good understand in layout before image, when the layout image be different from dataset, I think the model generate incorent arrngement of items in right place in shelves. I used mean squre error loss function.

Q1- what’s the problem? Is the instructpix2pix soloution is best for this problem?
Q2- In your opinion, adding controlnet to instructpix2pix can be help?
Q3- Is it possible to use controlnet img2img for this problem? is it can be right solution also?