Training generative models based on "rewards"

Suppose you we want to train BART/T5. Typically these models are trained assuming that we have direct access to gold outputs. I am interested in a slightly different setting: suppose you don’t have the gold output, but you have access to a black-box (a reward function) that tells you how “correct” is the current generation. Does anyone have thoughts on how this could be done?