Diffusion models for environmental sound generation

I have in mind to generate environmental sounds from text or even simpler numerical values, based on stable diffusion. Does anyone have any research suggestions for me? The idea is to generate a sound scene like “rain with a very strong wind”. Or just modulate the intensity of the rain for example.
Thanks in advance for the ideas/advice.