For the purpose of visualizing work in progress on the image inference pipelines, I’d like to convert the latents to a display format at each step along the way. But I find that if I run them through the pipeline’s VAE decoder, that takes far too much time that could be better spent on the pipeline’s main task.
Because this is only for preview purposes, I don’t need to use an expensive method to scale them back up to 512×512. It would be sufficient to leave them at 64×64, letting the application do some naive upscaling if desired.
Is there a way to decode the 4-channel latent space to a 3-channel image format without upscaling? Will that be dramatically faster than the full decode+upscale method?
I tried fumbling around a bit creating an instance of AutoencoderKL configured similarly to that of the pretrained model, but without so many
UpDecoderBlocks. Unfortunately that attempt didn’t lead to any useful results, and I’m not sure if that was even on the right track.
After some empirical tests, I have determined that I can get a useful approximation of the RGB output using a linear combination of the latent channels.
This approximation comes from multiplying the four latent channels by these factors:
v1_4_rgb_latent_factors = [
# R G B
[ 0.298, 0.207, 0.208], # L1
[ 0.187, 0.286, 0.173], # L2
[-0.158, 0.189, 0.264], # L3
[-0.184, -0.271, -0.473], # L4
[This is for Stable Diffusion v1.4. I assume it’s not universally true.]
Here’s the output from the actual VAE decoder for comparison:
The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but it’s not bad considering it’s over two thousand times faster.
So that’s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully there’s an easier way to determine those values or some other similarly cheap approximation?
Wow, that’s a very neat way to approximate the decoded image, never thought that the latents would end up so interpretable!
Will keep that in mind for future pipelines
That’s pretty amazing, do those numbers work for all types of images? I wonder why your initial experiment to use a smaller decoder wouldn’t work, it sounds like a reasonable idea to me!
I fit them against some grayscale images, some bright saturated colors like the hot air balloon example above, and some mid-tones. The result seems close enough for the images I’ve seen come out of it. Sometimes the color accuracy is worse than others, but it’s sufficient if you’re just trying to get a rough idea of the composition.
I’ve posted some demo code for both gradio and ipywidgets: GitHub - keturn/sd-progress-demo
I’m sure it would have gone better if I had any idea what I was doing. But this is all my first project involving pytorch and neural networks. I have some more catching up to do on the fundamentals before I can even read how the decoder is put together, let alone understand how to modify it.
This post was flagged by the community and is temporarily hidden.