I have recently conducted a human evaluation of a chatbot via a survey. I wonder how I can prove that the results are statistically significant.
More specifically, I compared the generated responses of two chatbots and calculated each one’s win rate. Moreover, participants were asked to rate each model according to “relevance” and “fluency” using a scale ranging from 1 to 5.
According to some references (e.g. DodecaDialogue paper ), they prove that the results are statistically significant using binomial testing.
How can I apply binomial testing in the aforementioned case?