For the purpose of visualizing work in progress on the image inference pipelines, I’d like to convert the latents to a display format at each step along the way. But I find that if I run them through the pipeline’s VAE decoder, that takes far too much time that could be better spent on the pipeline’s main task.
Because this is only for preview purposes, I don’t need to use an expensive method to scale them back up to 512×512. It would be sufficient to leave them at 64×64, letting the application do some naive upscaling if desired.
Is there a way to decode the 4-channel latent space to a 3-channel image format without upscaling? Will that be dramatically faster than the full decode+upscale method?
I tried fumbling around a bit creating an instance of AutoencoderKL configured similarly to that of the pretrained model, but without so many
UpDecoderBlocks. Unfortunately that attempt didn’t lead to any useful results, and I’m not sure if that was even on the right track.
After some empirical tests, I have determined that I can get a useful approximation of the RGB output using a linear combination of the latent channels.
This approximation comes from multiplying the four latent channels by these factors:
v1_4_rgb_latent_factors = [
# R G B
[ 0.298, 0.207, 0.208], # L1
[ 0.187, 0.286, 0.173], # L2
[-0.158, 0.189, 0.264], # L3
[-0.184, -0.271, -0.473], # L4
[This is for Stable Diffusion v1.4. I assume it’s not universally true.]
Here’s the output from the actual VAE decoder for comparison:
The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but it’s not bad considering it’s over two thousand times faster.
So that’s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully there’s an easier way to determine those values or some other similarly cheap approximation?
Wow, that’s a very neat way to approximate the decoded image, never thought that the latents would end up so interpretable!
Will keep that in mind for future pipelines
That’s pretty amazing, do those numbers work for all types of images? I wonder why your initial experiment to use a smaller decoder wouldn’t work, it sounds like a reasonable idea to me!
I fit them against some grayscale images, some bright saturated colors like the hot air balloon example above, and some mid-tones. The result seems close enough for the images I’ve seen come out of it. Sometimes the color accuracy is worse than others, but it’s sufficient if you’re just trying to get a rough idea of the composition.
I’ve posted some demo code for both gradio and ipywidgets: GitHub - keturn/sd-progress-demo
I’m sure it would have gone better if I had any idea what I was doing. But this is all my first project involving pytorch and neural networks. I have some more catching up to do on the fundamentals before I can even read how the decoder is put together, let alone understand how to modify it.
Got this working on my img2img doc branch of the InvokeUI fork. you can pass
--write_intermediates to the
dream> prompt and it will write every latent step to a png file.
That’s very cool! Are you doing it without upscaling? Did you train a decoder, or did you do it differently? Can you please point me to the place in the code where that happens? Thanks!
No upscaling, it’s a tiny image (64x64 pixels at image size=512x512). InvokeAI/dream.py at fe401e88a0a143f01d175b6e56e3a1bf5d60ee2d · damian0815/InvokeAI · GitHub. i’m just using keturn’s/erucipe’s method.
Ah, I misunderstood, thought it was a different method
Hi, this is very good! I solute you! o7
I notice you are using a linear approximation with no constant term x–>Ax (where A is the matrix)
Other things people could try are:
(1) Linear approximation with constant term x->Ax+B
(2) A quadratic approximation x->Ax^2+Bx+C (this would need a total of 63 values to find!). A would be a 3x4x4 tensor.
(3) A cubic approximation x->Ax^3+Bx^2+Cx+D. (255 values)
The quadratic approximation should only be about 4 times slower but that should be fine since the process can be done in milliseconds.
The values should be able to be found using a simple neural network with random images as training data. If anyone wants to give it a go. Let us know the results
I would be very interested in a way to directly decode the latent with less noise and more accurate colors, if anyone has an idea on how to do that I would love to try it out.