Building an SFX Generation model with Encodec

I’m trying to build my first major architecture for SFX Sound Synthesis. I’m planning on training the model on 6 hours of Film FX datasets (footsteps, dog barks, etc) and train the model so that it can produce novel sounds for whatever input that’s fed into it.

My original pipeline involved:

Training: Input —> Spectrograms —> VQVAE —> PixelSNAIL

Inference: Trained PixelSNAIL —> VQVAE Decoder —> HiFiGAN —> Output

However, now I’m thinking of using Meta’s Encodec’s tokenizer to replace the spectrograms and feeding that into the VQVAE because Encodec seems to compress audio data extensively while still retaining the most important salient features (than Spectrogram). I just wanted your guy’s thoughts on this pipeline and whether you’ve worked on similar projects with Encodec and if it’s provided success for these type of tasks.