Human Evaluation and Statistical significance

manzar · April 8, 2021, 8:21pm

Hello,
I have recently conducted a human evaluation of a chatbot via a survey. I wonder how I can prove that the results are statistically significant.
More specifically, I compared the generated responses of two chatbots and calculated each one’s win rate. Moreover, participants were asked to rate each model according to “relevance” and “fluency” using a scale ranging from 1 to 5.
According to some references (e.g. DodecaDialogue paper ), they prove that the results are statistically significant using binomial testing.

How can I apply binomial testing in the aforementioned case?

@patrickvonplaten

Topic		Replies	Views
Deterministic Evaluation Methods for Dialogue Systems Leveraging GPT-4 Models	0	330	November 6, 2023
Measure statistical significance betweetn Beginners	0	205	June 14, 2022
Have you submitted feedback about ChatGPT? Research	4	616	June 27, 2023
Any study of failures of nlp models vs schoolchildren on QA or POS? Research	1	545	March 26, 2024
What are the key metrics to evaluate the performance of an AI chatbot? Beginners	0	187	June 28, 2024

Human Evaluation and Statistical significance

Related topics