After some empirical tests, I have determined that I can get a useful approximation of the RGB output using a linear combination of the latent channels.
This approximation comes from multiplying the four latent channels by these factors:
v1_4_rgb_latent_factors = [
# R G B
[ 0.298, 0.207, 0.208], # L1
[ 0.187, 0.286, 0.173], # L2
[-0.158, 0.189, 0.264], # L3
[-0.184, -0.271, -0.473], # L4
]
[This is for Stable Diffusion v1.4. I assume it’s not universally true.]
Here’s the output from the actual VAE decoder for comparison:
The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but it’s not bad considering it’s over two thousand times faster.
So that’s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully there’s an easier way to determine those values or some other similarly cheap approximation?