As it is a very small model, it seems that as long as the hardware has around 2GB of VRAM, it should be fine. It should work with almost all GPU rental services…
The problem is how to generate the audio data (GUI? CLI? Self-made script?) and where to store the data for such a long time.
For example, if the generation method is fine with the Hugging Face GUI space and the storage destination is your hard disk, the following would be the cheapest option for a plan with no time limit.
If you want to save your data online, you can use Hugging Face’s private model repository or dataset repository, which should be enough to store up to about 100GB.