How to train a text-to-sing model?

I am wondering what is the difference between a text-to-speech and text-to-sing model. Is text-to-sing a modified/advanced version of text-to-speech model?

How much data do I need to be able to train a model that can sing any text prompt with a “one” specific voice, but with several rhythm/speed?