I have two models (`model1`

, `model2`

) which are run on a test dataset of `5,800`

instances,

So I show each data point to each model, and they generate a response.

If the response from the model matches with the expected outcome, then I count it as it as “1”, else it would be “0”. So its `binary`

outcome.

- Accuracy from
`model1`

is 68%. - Accuracy from
`model2`

is 75%.

But I was asked that the improvement with model2 compared to model1 is not statistically significant.

I was further asked to report a confidence interval for a two-sample difference in proportion tests.

Also what is the statistical significance of my result?

I don’t understand this query.

- I ran inference with all the data points, and it is not like I only tested for a proportion of the test dataset.

Please note I have read about confidence interval, Paired Samples T-Test from different blogs, Khan academy, and watched many youtube videos.

But I cant figure this out. All the videos apply these techniques for numeric values, and not for a binary metric like the above.

As an example, in this tutorial Paired Samples T-Test (How to calculate and interpret) - YouTube they applied this for weight loss. As a result they were able to get `mean`

, `standard deviation`

. But for my case its a binary outcome i.e. `0`

or `1`

.

So how to calculate statistical significance from the accuracy from two models?

Can anyone please help me?