Need to verify results of transformer models by running cross-validation or statistical methods? If so, how?

I see some papers that run K-Fold cross validation on their models’ results using transformers like BERT, but I’m not sure how to do it in HF, or if we need to do it.
Say I’m trying to check if there’s a real difference between the performance metrics of two different models, say Model SA and SB, then how do I make sure the performance difference is significant and it’s not due to random chance?
I suppose I can make a global variable that holds the metrics and append the results to them during the compute_metrics() call. But then what do I calculate the CV score? Most applications I’ve seen are using sklearn models, which have a direct .score() call to the model. Most guides I see something like:

from sklearn.model_selection import cross_validate
_scoring = ['accuracy', 'precision', 'recall', 'f1']
results = cross_validate(estimator=model,
                                   X=_X,
                                   y=_y,
                                   cv=_cv,
                                   scoring=_scoring,
                                   return_train_score=True)

Which I don’t think we can do that with HF models.

So if we cannot do CV, then what other statistical methods should we use to validate our results? ANOVA? Kruskal Wallis?