How to calculate statistical significance for accuracy from two model?

I have two models (`model1`, `model2`) which are run on a test dataset of `5,800` instances,

So I show each data point to each model, and they generate a response.

If the response from the model matches with the expected outcome, then I count it as it as “1”, else it would be “0”. So its `binary` outcome.

• Accuracy from `model1` is 68%.
• Accuracy from `model2` is 75%.

But I was asked that the improvement with model2 compared to model1 is not statistically significant.

I was further asked to report a confidence interval for a two-sample difference in proportion tests.

Also what is the statistical significance of my result?

I don’t understand this query.

• I ran inference with all the data points, and it is not like I only tested for a proportion of the test dataset.

As an example, in this tutorial Paired Samples T-Test (How to calculate and interpret) - YouTube they applied this for weight loss. As a result they were able to get `mean`, `standard deviation`. But for my case its a binary outcome i.e. `0` or `1`.