For the purpose of visualizing work in progress on the image inference pipelines, Iâ€™d like to convert the latents to a display format at each step along the way. But I find that if I run them through the pipelineâ€™s VAE decoder, that takes far too much time that could be better spent on the pipelineâ€™s main task.

Because this is only for preview purposes, I donâ€™t need to use an expensive method to scale them back up to 512Ã—512. It would be sufficient to leave them at 64Ã—64, letting the application do some naive upscaling if desired.

Is there a way to decode the 4-channel latent space to a 3-channel image format without upscaling? Will that be dramatically faster than the full decode+upscale method?

I tried fumbling around a bit creating an instance of AutoencoderKL configured similarly to that of the pretrained model, but without so many UpDecoderBlocks. Unfortunately that attempt didnâ€™t lead to any useful results, and Iâ€™m not sure if that was even on the right track.

After some empirical tests, I have determined that I can get a useful approximation of the RGB output using a linear combination of the latent channels.

This approximation comes from multiplying the four latent channels by these factors:

v1_4_rgb_latent_factors = [
# R G B
[ 0.298, 0.207, 0.208], # L1
[ 0.187, 0.286, 0.173], # L2
[-0.158, 0.189, 0.264], # L3
[-0.184, -0.271, -0.473], # L4
]

[This is for Stable Diffusion v1.4. I assume itâ€™s not universally true.]

Hereâ€™s the output from the actual VAE decoder for comparison:

The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but itâ€™s not bad considering itâ€™s over two thousand times faster.

So thatâ€™s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully thereâ€™s an easier way to determine those values or some other similarly cheap approximation?

Wow, thatâ€™s a very neat way to approximate the decoded image, never thought that the latents would end up so interpretable!
Will keep that in mind for future pipelines

Thatâ€™s pretty amazing, do those numbers work for all types of images? I wonder why your initial experiment to use a smaller decoder wouldnâ€™t work, it sounds like a reasonable idea to me!

I fit them against some grayscale images, some bright saturated colors like the hot air balloon example above, and some mid-tones. The result seems close enough for the images Iâ€™ve seen come out of it. Sometimes the color accuracy is worse than others, but itâ€™s sufficient if youâ€™re just trying to get a rough idea of the composition.

Iâ€™m sure it would have gone better if I had any idea what I was doing. But this is all my first project involving pytorch and neural networks. I have some more catching up to do on the fundamentals before I can even read how the decoder is put together, let alone understand how to modify it.

Got this working on my img2img doc branch of the InvokeUI fork. you can pass --write_intermediates to the dream> prompt and it will write every latent step to a png file.

Thatâ€™s very cool! Are you doing it without upscaling? Did you train a decoder, or did you do it differently? Can you please point me to the place in the code where that happens? Thanks!

I notice you are using a linear approximation with no constant term xâ€“>Ax (where A is the matrix)

Other things people could try are:

(1) Linear approximation with constant term x->Ax+B
(2) A quadratic approximation x->Ax^2+Bx+C (this would need a total of 63 values to find!). A would be a 3x4x4 tensor.
(3) A cubic approximation x->Ax^3+Bx^2+Cx+D. (255 values)

The quadratic approximation should only be about 4 times slower but that should be fine since the process can be done in milliseconds.

The values should be able to be found using a simple neural network with random images as training data. If anyone wants to give it a go. Let us know the results

I would be very interested in a way to directly decode the latent with less noise and more accurate colors, if anyone has an idea on how to do that I would love to try it out.

Chiming in with some other options for latent â†’ RGB previewing:

BirchLabs posted a few more linear-layer-only decoders here (including a linear + bias version and a 3 layer MLP). These all produce RGB images that are the same size as the latents.

I posted a tiny conv-only latent decoder here, which produces full 512x512 preview images (itâ€™s slower than the linear-layer-only options, but should still be way faster than the official decoder).