Help Finding Dataset with Increasing Vocab Size and/or Reading Difficulty

Can anyone point me towards any corpus with a natural increase in the vocabulary size and/or reading difficulty? I’m thinking of something akin to how children and young adult books come with estimated grade levels (e.g. “5th grading reading level”).

Have you considered data from CommonLit Readability Prize Kaggle competition? I guess it doesn’t exactly fit your description but it’s kind of similar.

Thanks! I’ll take a look :slight_smile: