How does Riffusion model generate vocals in music?

I am wondering how the Riffusion model converts our text into a singer’s voice and adds background music to it. I can understand how it generates music, but I can’t comprehend how it generates the singer’s voice and integrates it with the music. Does it use any text-to-speech engine? How does it match the vocal speed/rhythm with the generated music?