Seeking Advice on Stratifying a Multi-label NER Dataset for Balanced Train/Test Split

krishnareddy · January 4, 2024, 2:01pm

Hello Everyone,

I am working on a Named Entity Recognition (NER) project where we have an extensive dataset, specifically tailored for healthcare. Our dataset contains around 3,000 distinct disease entities (entity types), each annotated with unique numbers for Beginning (B-Tag) and Inside (I-Tag) tags.

Here’s an example of the data format we’re working with:

{
    "ner_tags": [1, 2, 2, 4, 5, 0, 7, 8, 8],
    "tokens": ["Chief", "complaint", ":", "Diabetes", "mellitus", "and", "chronic", "kidney", "disease"]
}

In this example, each token in the sentence is associated with a NER tag that denotes its entity type.

The challenge I’m facing is how to properly split this dataset into training, validation, and testing sets. Given the high number of distinct entities, it’s crucial that each set is representative of the entire range of diseases. This is essential to ensure that the NER model we’re developing is well-trained and can generalize effectively across unseen data.

I’m seeking advice on:

Effective strategies for splitting a large, multi-entity NER dataset.
Ensuring that all entities are adequately represented in each subset (training, validation, testing).
Any specific considerations or techniques when dealing with a high number of entity types in NER.

I’m particularly interested in methodologies that can handle the complexity of our dataset, ensuring a balanced and comprehensive representation of all entities.

Any insights, experiences, or resources you can share would be greatly appreciated. Thank you for your time and help!

Best,
Krishna Reddy

yogiyogi123yogesh · March 28, 2024, 4:18pm

what is your dataset name i too working on a same project .please suggest correct dataset to train my model

krishnareddy · April 2, 2024, 9:14am

Hi, Its private data set by the organisation. and I don’t have any info about publicly available EHR clinical reports other than i2b2

Topic		Replies	Views
BERT Split NER Labeling Intermediate	1	1058	December 7, 2021
Issue with processing custom dataset for Named Entity Recognition 🤗Datasets	4	770	February 15, 2022
Named Entity Recognition: fine-tune or create new model? Beginners	3	3545	February 11, 2023
NER: Treat whole sequence as one entity Beginners	10	704	June 3, 2024
Train a model for NER using a vocabulary (list of values) to generate multiple tags Beginners	0	222	October 8, 2022

Seeking Advice on Stratifying a Multi-label NER Dataset for Balanced Train/Test Split

Related topics