Decoding latents to RGB without upscaling

After some empirical tests, I have determined that I can get a useful approximation of the RGB output using a linear combination of the latent channels.

sd-latent-channels-linear-approximation

This approximation comes from multiplying the four latent channels by these factors:

v1_4_rgb_latent_factors = [
    #   R       G       B
    [ 0.298,  0.207,  0.208],  # L1
    [ 0.187,  0.286,  0.173],  # L2
    [-0.158,  0.189,  0.264],  # L3
    [-0.184, -0.271, -0.473],  # L4
]

[This is for Stable Diffusion v1.4. I assume it’s not universally true.]

Here’s the output from the actual VAE decoder for comparison:

The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but it’s not bad considering it’s over two thousand times faster.

So that’s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully there’s an easier way to determine those values or some other similarly cheap approximation?

6 Likes