I had the idea of creating a service that converts a song of one genre to transform into a song of the same lyrics but in a different genre and instrumental
I was wondering if it would make more sense to first use a audio to text model to get the lyrics and then use one of the text to music models to create the song in the new genre or if I should attempt to train a new audio to audio model to do both tasks at the same time.