Hi,
I’m using a pre-trained model (distilbert-base-cased-distilled-squad) for Question and Answering and I’m looking for a solution to improve the model using user feedbacks as rewards and penalties which indicate how well the model answered to the question in a given context. I’ve found a Transformer Reinforcement Learning (Trl) library which is built on top of the transformer library by Hugging Face that can be used to train transformer language models with Proximal Policy Optimization (PPO). But it was only implemented for a decoder architectures such as GPT2 . I’m wondering if there’s a workaround to use a similar approach on improving pre-trained distilBERT model using reinforcement based method (using reward and penalty scores for question and answer pairs) or any other possible solution for this?
Thank you.