Metadata of NLP datasets

Hi,
I’m new to the NLP domain and HuggingFace ecosystem. :slight_smile:
I wanted to some suggestions on where to read about the meta data of datasets used for NLP.
I have worked mostly with vision data so far and simple meta features shared by image datasets in general were:

  • image resolution
  • No. of training samples
  • No. of classification labels
  • No. of channels

Would the text data used in NLP tasks have some such features in common? Aside Number of training samples and number of classification labels. Any thoughts are welcome.

Thanks!