Currently, we have a use case where we are asking multiple questions to set of conversations tied to conversation id using LLM ( currently testing LLAMA , GPT , falcon). We are using the answers generated out of all the questions to generate a cohesive summary for that Conversation ID refining it through LLM.
How we can work on summary evaluation keeping the cost low and without depending on HUman evaluation?
How we can architecturally or programmatically make this approach better?
We will be using multiple models and testing them with benchmarks.
We are currently using BERT Score , rogue , BELU and exploring GEVAL.