For the purpose of visualizing work in progress on the image inference pipelines, I’d like to convert the latents to a display format at each step along the way. But I find that if I run them through the pipeline’s VAE decoder, that takes far too much time that could be better spent on the pipeline’s main task.
Because this is only for preview purposes, I don’t need to use an expensive method to scale them back up to 512×512. It would be sufficient to leave them at 64×64, letting the application do some naive upscaling if desired.
Is there a way to decode the 4-channel latent space to a 3-channel image format without upscaling? Will that be dramatically faster than the full decode+upscale method?
I tried fumbling around a bit creating an instance of AutoencoderKL configured similarly to that of the pretrained model, but without so many UpDecoderBlocks. Unfortunately that attempt didn’t lead to any useful results, and I’m not sure if that was even on the right track.
The approximation is a little undersaturated, maybe it could stand to have a bit of tuning, but it’s not bad considering it’s over two thousand times faster.
So that’s useful to know. But I certainly feel like I did this the hard way, by analyzing some outputs and developing a new approximation of them. Hopefully there’s an easier way to determine those values or some other similarly cheap approximation?
I fit them against some grayscale images, some bright saturated colors like the hot air balloon example above, and some mid-tones. The result seems close enough for the images I’ve seen come out of it. Sometimes the color accuracy is worse than others, but it’s sufficient if you’re just trying to get a rough idea of the composition.
I’m sure it would have gone better if I had any idea what I was doing. But this is all my first project involving pytorch and neural networks. I have some more catching up to do on the fundamentals before I can even read how the decoder is put together, let alone understand how to modify it.
I notice you are using a linear approximation with no constant term x–>Ax (where A is the matrix)
Other things people could try are:
(1) Linear approximation with constant term x->Ax+B
(2) A quadratic approximation x->Ax^2+Bx+C (this would need a total of 63 values to find!). A would be a 3x4x4 tensor.
(3) A cubic approximation x->Ax^3+Bx^2+Cx+D. (255 values)
The quadratic approximation should only be about 4 times slower but that should be fine since the process can be done in milliseconds.
The values should be able to be found using a simple neural network with random images as training data. If anyone wants to give it a go. Let us know the results