Advice for Singing Synthesis and Tokenization

Hi everyone (yes I reformatted this post using AI to give it a bit more structure :grinning_face_with_smiling_eyes:)!

I’m a CS student from Germany planning my master’s thesis on singing synthesis (starting late 2025). I’ve been self-teaching audio synthesis since early this year and have at least practical experience with transformer architectures and only the theory behind diffusion/flow-matching but I need practical guidance on audio codec implementation.

My Planned Architecture

  1. VAE codec → continuous latents (for flow-matching compatibility)

  2. Autoregressive model → generates structural latents (timing, melody contour, song structure)

  3. Diffusion transformer → refines latents into high-quality audio

  4. Conditioning on lyrics, style prompts (optional), or reference audio clips (optional)

I also want to be able to encode general musical features of an audio clip using a VAE (such as pitch, mood, tempo, the tone of the vocalist etc.), which I am not even sure if thats possible or if this is way too much for a master thesis, especially when I later on also have to design the downstream model additionally. So please tell me if what I am planning right here is unrealistic.

My Confusion: Training on Short vs. Long Audio

I understand VAE theory but I’m unclear on practical implementation for full-length songs (such as 3 minutes):

What I’ve read:

  • Some papers (EnCodec) use 1-second chunks with 10ms overlap during encoding

  • Other papers (Stable Audio VAE) process longer sequences (47seconds) in one forward pass

My questions:

  1. For VAE training: Should I train on 10-second clips and later encode full 3-minute songs with the trained model? Or must I train on full-length songs from the start?

  2. Chunking strategy: If I train my VAE on 10-second clips, can I still train my downstream generative model on the full 3-minute latent sequences? (Encode full songs with trained VAE → use those latents for generation training)

  3. Style extraction: I want to extract musical features as mentioned earlier into a style vector. Do I need:

    • Supervised labels (requires specific dataset)

    • Or can this be learned unsupervised?

What I Think I Understand (Please Correct Me!)

  • VAE training clips ≠ generative model training sequence length

  • We can usually NOT encode full length songs (3 minutes) at once . There will always be chunking, be it 1 second chunks, 10 second chunks or 47 second chunks, depending on the architecture

I’d really appreciate guidance from anyone who’s implemented audio VAEs or other audio codecs before, especially for music generation. Am I on the right track?
Also, if you are interested in writing on this topic in general or whatever I would also gladly get into touch with you - this is my discord:
jodas.

Thanks in advance!

1 Like

I’ll just leave the resources here for now.:sweat_smile:

For questions about cases where science and ML/AI intersect, I think it’s more reliable to ask on the Hugging Science Discord (a separate Discord from the Hugging Face Discord). That Discord is relatively new, but it has a lot of knowledgeable people.

1 Like

Dude what, I almost even forgot about this post and only found it because I was scimming through my spam folder in my mails and found the HF notification. You’re a legend, thanks for taking your time with this!

1 Like