I need to codify medical conditions with diagnostic codes. For example “head injury” may be coded as "S02.0, S02.1 Fracture of skull ". I would like to use a model to find likely diagnosis code candidates for entered text. What is the best approach to solving this task? I can either try to find the closest semantic similarity between input sentence and list of diagnosis or I can try to do multi-label classification where diagnostic code is a class. Any ideas, suggestions? Thanks.
Either of your approaches could work.
Do you have a corpus of documents that contains both medical conditions and codified medical conditions?
Yes, we have 2 data sources: (1) corpus with all notes and (2) list of diagnosis codes with descriptions. We can train embeddings on the corpus and then run embeddings on descriptions of diagnostic codes. My concern is the number of labels (at least 100), not sure how well the classifier can handle this many labels.
So the codes are technically in a different corpus? Then I’d probably try retrieving embeddings before the classifier.
Yes, the plan was to embed code descriptions for either sentence similarity or classification, but which one to try ?!
Or maybe we can use a zero-shot classification pipeline? we can pass sentence and possible labels.
Honestly, you should try both and see which one does better. ML is a very iterative process, so it’s always best to try different things.
Personally, I’d first try similarity.