(Audio-to-audio models) Should I use 2 models sequentially or create 1 model for attempting to make a music to music model?

I had the idea of creating a service that converts a song of one genre to transform into a song of the same lyrics but in a different genre and instrumental

I was wondering if it would make more sense to first use a audio to text model to get the lyrics and then use one of the text to music models to create the song in the new genre or if I should attempt to train a new audio to audio model to do both tasks at the same time.