Hi all,
I am working on developing an NLP classification model for clinical data with ClinicalBERT. The goal is to reliably classify medical records as indicating a patient is positive or negative for a specific condition. It is currently performing well, but its only mistakes are on a specific type of record. Specifically, one where it’s a dual biopsy on a given organ, usually in the form of ‘left organ, positive, right organ, negative’. Or ‘left organ negative, right organ, positive.’
I believe that the reasons for this are twofold. One- this is an imbalanced dataset as it is, so examples of positives are already rare. Two- out of the minority of positive examples only a minority of those are dual biopsy, and of those only a certain amount are positive in one organ and negative in the other.
My first solution was to augment the data with more of this specific class using either backtranslation or generation, however my first attempt (using generation) led to a marked drop in performance. Another idea was to separate the dual biopsy (easily done) records, train another model specifically on classifying those, and in the production pipeline simply have all dual biopsies be funneled to that model for classification.
However, I am unsure if this is a more sustainable solution than data augmentation. Does anybody have tips for data augmentation in this specific use case? Most of the NLP data augmentation examples I have seen on the web seem not to care about context, which is critical in this case.