VAE change shape of latent space

I’m trying to get some diffusion model to generate pixel art, but speed is important (I’m hoping to turn it into a real-time interactive game, so am aiming for ~30fps), so I can’t just generate a 512x512 image and then downscale it.

I’ve followed the blog blog/stable_diffusion which helped a lot to get started, but now I want to generate a smaller image, which is proving difficult to figure out because everyone else wants to generate bigger images.

When I try to decrease the size of the output image, the quality goes down because the shape of the latent space is chosen automatically based on the requested output image’s shape:

shape = (
    batch_size,
    unet.config.in_channels,
    height // 8,
    width // 8
)
# Seed generator to create the inital latent noise
generator = torch.manual_seed(0)
latents = (torch.randn(shape, generator=generator))

But if I change the shape so that my image is (for example) 64x64 but the latent space is also 64x64, then when I try to decode the latent space image the output image becomes 512x512 because (as far as I can tell) the VAE assumes the latent space is 8x smaller than the requested output image.

Is it possible to tell the VAE what size the latent space should be? I’m defining the VAE like so:

    vae = AutoencoderKL.from_pretrained(
        "CompVis/stable-diffusion-v1-4",
        subfolder="vae",
        variant="fp16"
    )