Fine tuning gpt-neo via ppo

I have a wild idea to improve smaller gpt3 esqe models by tuning their output with ppo a reinforcement learning paper. Originally, this was done to adjust gpt2’s performance to human preference. https://arxiv.org/pdf/1909.08593.pdf

I propose to fine-tune gpt neo directly on “prompt driven” data. Most obviously, higher performing models could teach the lower performance models by providing examples from which the smaller lower performance models could learn.

However I wonder if it is possible to fine-tune the model in a narrower domain ie code completion like copilot. Would proof writing not be the ideal test? With many proofs accessible, perhaps it would make for easily accessible data with more definitive evaluation than conversational quality. Ie we might compare a naive proof to a fine tuned proof of the same problem? I am aware that human eval is still required.

Other prompt driven data likely exists like essays etc. However the technical dream is to compress model performance by fine-tuning with ppo on examples that are sourced from lqrger/higher performance models. Perhaps then we might be able to pull in robust narrow capacities from larger models into smaller models without distilling the entire teacher models knowledge.

Is this a good idea to try? And is the model simply to big to consider this email? Ie deep speed questions

Best,
Aidan