Improve DistilBERT Question and Answering model with reinforcement learning


I’m using a pre-trained model (distilbert-base-cased-distilled-squad) for Question and Answering and I’m looking for a solution to improve the model using user feedbacks as rewards and penalties which indicate how well the model answered to the question in a given context. I’ve found a Transformer Reinforcement Learning (Trl) library which is built on top of the transformer library by Hugging Face that can be used to train transformer language models with Proximal Policy Optimization (PPO). But it was only implemented for a decoder architectures such as GPT2 . I’m wondering if there’s a workaround to use a similar approach on improving pre-trained distilBERT model using reinforcement based method (using reward and penalty scores for question and answer pairs) or any other possible solution for this?

Thank you.

If you haven’t already, read this paper, it’ll give you everything you need.

Once you determine what your policy \rho(y|x) is, the rest is just filling in the pieces. Possibly simpler than the decoder model because each episode is a single action i.e. generate a span?

Thanks for the help, appreciate it.

that was really good