How to augment specific data type?

jrmodels99 · May 1, 2025, 6:37pm

Hi all,

I am working on developing an NLP classification model for clinical data with ClinicalBERT. The goal is to reliably classify medical records as indicating a patient is positive or negative for a specific condition. It is currently performing well, but its only mistakes are on a specific type of record. Specifically, one where it’s a dual biopsy on a given organ, usually in the form of ‘left organ, positive, right organ, negative’. Or ‘left organ negative, right organ, positive.’

I believe that the reasons for this are twofold. One- this is an imbalanced dataset as it is, so examples of positives are already rare. Two- out of the minority of positive examples only a minority of those are dual biopsy, and of those only a certain amount are positive in one organ and negative in the other.

My first solution was to augment the data with more of this specific class using either backtranslation or generation, however my first attempt (using generation) led to a marked drop in performance. Another idea was to separate the dual biopsy (easily done) records, train another model specifically on classifying those, and in the production pipeline simply have all dual biopsies be funneled to that model for classification.

However, I am unsure if this is a more sustainable solution than data augmentation. Does anybody have tips for data augmentation in this specific use case? Most of the NLP data augmentation examples I have seen on the web seem not to care about context, which is critical in this case.

jrmodels99 · May 1, 2025, 8:45pm

BTW I have already implemented balanced class weights (was beneficial), and increased/decreased weight decay (dropped performance).

juhoinkinen · May 10, 2025, 4:03am

Recently we got good results by giving a LLM one labeled example from the (original) dataset a time and prompting it to generate a new, similar example with the same labels plus one new label (randomly chosen from the used vocabulary).

Details and prompt in appendix D:

Topic		Replies	Views
Named Entity Recognition in medical notes 🤗Transformers	0	645	November 14, 2022
MedClip - Pretraining CLIP on medical data Flax/JAX Projects	25	4961	July 9, 2021
[Help Needed] Suicide Risk Detection from Long Clinical Notes (Few-shot + ClinicBERT approaches struggling) Models	2	27	June 10, 2025
Sentence Similarity or Sentence Classification Task? Beginners	6	946	March 11, 2021
Seeking Advice on Stratifying a Multi-label NER Dataset for Balanced Train/Test Split Beginners	2	499	April 2, 2024

How to augment specific data type?

Related topics