Help Finding Dataset with Increasing Vocab Size and/or Reading Difficulty

Can anyone point me towards any corpus with a natural increase in the vocabulary size and/or reading difficulty? I’m thinking of something akin to how children and young adult books come with estimated grade levels (e.g. “5th grading reading level”).

1 Like

Have you considered data from CommonLit Readability Prize Kaggle competition? I guess it doesn’t exactly fit your description but it’s kind of similar.

1 Like

Thanks! I’ll take a look :slight_smile: