Trade offs when upscale an image

Hey guys, I am looking for a method to up-scale an image and have implemented some codes.
I want to share about the results and findings
Hopefully we can improve the design a little bit and come up some new ideas.

Test base image (512x512):

Test steps:

  1. Upscale image to (2048x2048) using R-ESRGANx4
  2. Split image into tiles with size of 512x512, each tile overlaps with others by N pixels, where N is a parameters.
  3. Run diffuser pipeline img2img with same prompt that generates the base image
  4. Assemble the re-paint tiles and do a weighted addition between tiles.
if region of tile is overlapping:
   do a weighted addition and put the sum on canvas
else
   put the region on canvas
  1. Output the final image.

Time performance:
Euler-A is always faster with low number of iterations and reaches the same quality. I think it can go as low as 25 iterations. But I use 35 for best result.

Memory performance:
Both schedulers consume almost same GPU memory (about 6GB total in my case). Tiled diffusion is almost a must have, or my GPU always reports out of memory. So far, I never find any good approaches that can directly upscale an image from 512x512 to larger size. If you know any good methods, please share it.

Findings and issues:

  1. A good upscaler is a good beginning. This is actually very important. Since it is a trade-off between noise strength and consistency between tiles. If a poor up-scaler is used with a low noise strength, the final image will become blurry since it is “mimicking” the blurriness from the original image. However, if a high noise strength is used, the image does not look like the original image anymore.
  2. Overlapping size and ghost image. If you look at the test results, specifically near the trees or highly detailed region, you will find that there are ghost “regions”. This is because that it is a overlapping region and I am doing a weighted alpha sum here. The larger the overlapping size, the subjectively smoother the image will look like. However, it will increase the ghost region size as well. This is the issue that I am trying to resolve, and I did not find any related discussions.
  3. Is there any other methods that we can implement to compare the result? I know that web-ui has a “high res fix” implementation. However, after checking the code, it seems like a tiled diffusion based method. This post does diffusers have the equivalent to hires fix from A1111? · Issue #3429 · huggingface/diffusers · GitHub mentioned that there is a latent space approach. However, I did not find anymore discussion about it.

Test results (2048x2048):

  1. Tiled diffusion with DDIM, 45 iterations, overlapping 48 pixels.
  1. Tiled diffusion with Euler-A, 35 iterations, overlapping 48 pixels.
  1. Tiled diffusion with Euler-A, 35 iterations, overlapping 64 pixels.