Any study of failures of nlp models vs schoolchildren on QA or POS?

some nlp datasets are someway really similar to schoolchildren exercises, did anybody compared the failures of humans vs ia? this could bring interesting insight on both