What is the implement for text2vid VAE encoder in diffusers?

I’m working on building a text2vid model from scratch in pytorch and using diffusers as a source to read about the VAE architecture