Input Data Format for Inpainting with Changed in_channels in diffusers

I’m planning to train the stable-diffusion-2-inpainting model using my own dataset consisting of image.jpg and mask.jpg pairs. I understand that the input should have 9 channels. Could you provide guidance on how to preprocess these image and mask files and structure them correctly for training? Specifically:

  • What is the expected tensor structure for the 9-channel input?
  • How should I combine my image and mask files to conform to this structure?
  • Are there any specific preprocessing steps or code examples available?

Any guidance or example code would be very helpful.