I’m fine-tuning a T5 model on the CoQA dataset. The CoQA dataset is a conversational question-answering dataset where there is a sequence of question and answer pairs for a given story (context). I’ve trained a T5 model with input being a story, the question at a t time, and q&a pairs from t=1 to t-1, which is a pretty conventional way to train with a conversational QA dataset. Also I’ve used the same input setting during the inference.
However, I think the fair way is to not use the ground truth answers from t=1 to t-1 when generating an answer for the question at t, since it could be considered cheating. The fair way to evaluate would be sequentially running inference from t = 1 to t and using the answer that model generated at the inference time as history. Then the question I have is, how can I run this evaluation algorithm with a batch size larger than 1?
If there’s anyone who has thought about a similar problem, please do share!