Seeking Advice on Stratifying a Multi-label NER Dataset for Balanced Train/Test Split

Hello Everyone,

I am working on a Named Entity Recognition (NER) project where we have an extensive dataset, specifically tailored for healthcare. Our dataset contains around 3,000 distinct disease entities (entity types), each annotated with unique numbers for Beginning (B-Tag) and Inside (I-Tag) tags.

Here’s an example of the data format we’re working with:

{
    "ner_tags": [1, 2, 2, 4, 5, 0, 7, 8, 8],
    "tokens": ["Chief", "complaint", ":", "Diabetes", "mellitus", "and", "chronic", "kidney", "disease"]
}

In this example, each token in the sentence is associated with a NER tag that denotes its entity type.

The challenge I’m facing is how to properly split this dataset into training, validation, and testing sets. Given the high number of distinct entities, it’s crucial that each set is representative of the entire range of diseases. This is essential to ensure that the NER model we’re developing is well-trained and can generalize effectively across unseen data.

I’m seeking advice on:

  1. Effective strategies for splitting a large, multi-entity NER dataset.
  2. Ensuring that all entities are adequately represented in each subset (training, validation, testing).
  3. Any specific considerations or techniques when dealing with a high number of entity types in NER.

I’m particularly interested in methodologies that can handle the complexity of our dataset, ensuring a balanced and comprehensive representation of all entities.

Any insights, experiences, or resources you can share would be greatly appreciated. Thank you for your time and help!

Best,
Krishna Reddy