Is there any music vocals/voice-to-text model?

I had this idea some time ago, about a model that could extract the vocals from a song and then return the lyrics to the user.
The Split Audio Tracks to MusicGen extracts the vocals pretty well, but I can’t save the extracted audio. If I could, then I could feed it to another tool to transform the vocals into text.

So, my idea consists in joining these two processes together in one tool. The user uploads the file, or an url to the song, and the model does its job. Does anyone know if such a tool already exists?

