Building an SFX Generation model with Encodec

corlioneeee · August 19, 2024, 3:06pm

I’m trying to build my first major architecture for SFX Sound Synthesis. I’m planning on training the model on 6 hours of Film FX datasets (footsteps, dog barks, etc) and train the model so that it can produce novel sounds for whatever input that’s fed into it.

My original pipeline involved:

Training: Input —> Spectrograms —> VQVAE —> PixelSNAIL

Inference: Trained PixelSNAIL —> VQVAE Decoder —> HiFiGAN —> Output

However, now I’m thinking of using Meta’s Encodec’s tokenizer to replace the spectrograms and feeding that into the VQVAE because Encodec seems to compress audio data extensively while still retaining the most important salient features (than Spectrogram). I just wanted your guy’s thoughts on this pipeline and whether you’ve worked on similar projects with Encodec and if it’s provided success for these type of tasks.

Topic		Replies	Views
Decding Large Audio Files Using Wav2Vec2ForCTC Model Models	2	741	October 28, 2021
What is the implement for text2vid VAE encoder in diffusers? 🧨 Diffusers	0	10	August 27, 2024
Denoising Autoencoder (DAE) tutorial? Beginners	0	351	November 7, 2023
Training Bart as a VAE for interpolation Models	0	672	August 1, 2022
Pretraining AutoencoderKL? 🧨 Diffusers	0	429	January 25, 2023

Building an SFX Generation model with Encodec

Related topics