Faster Training with Self-play Models

Lately, there has been a surge of research interest in self-play & GAN type models in the LLM space. Specifically, I am referring to those models that don’t use supervised-style RLHF training like PPO or DPO, but rather learn through some type of generating samples on the fly algorithm (think GANs or on-policy RL). So my question is—what is the best distributed GPU framework to train models like these? I don’t think accelerate or FSDP works very well, because neither is designed to handle sampling during training (correct me if I am wrong). Are these any other training frameworks that would be good?