FDA Label Document Embedding

Hi everyone,
I am looking for any ideas or advice that you guys may have obtained in similar situations.

I have been working on an NLP task to cluster medical documents for some time, and whilst I am eager to use transformers to get the best results, through all my efforts it seems that TF-IDF has worked best.

I am working with the SIDER side effect dataset, which provides annotated FDA medication labels, an example is here:
http://sideeffects.embl.de/media/pdf/fda/17106s032lbl/annotated.html#C0026961_0

I have tried TF-IDF and SciBert through sentence transformers, selecting the most relevant passages, but no amazing results yet. Does anyone have any ideas or previous experience?

Many Thanks,
Chris

Hi @FL33TW00D, I ran into a similar problem last year with TF-IDF and found the following approach gave better results:

  1. Encode the documents, either with your favourite Transformer or Universal Sentence Encoder (the latter works really well!)
  2. Run UMAP on the embeddings to perform dimensionality reduction
  3. Cluster with HDBSCAN

HTH!

2 Likes

Hi @lewtun,
Thanks for the response.

How did you manage to encode the entire document? Did you perform summarization or did you split it up into chunks and average?

I’ve already included steps 2 and 3 in my pipeline, I feel its the representations that are holding me back! Do you think I should make an attempt to somehow include the annotations provided by the dataset into the representations?

Many Thanks,
Chris

In my case the documents were short emails, most of which could fit in the 512 token limit of USE - I did not try fancy things like summarization / chunking, but the latter would be my first thing to try for a long document :slight_smile:

Regarding the annotations, they might help, but you’d have to think carefully about how you plan to combine them with the embeddings before applying UMAP.

Perhaps a “quick and dirty” approach would be to experiment with is concatenating the hidden states from multiple layers to see if that improves your document representation (assuming you’re just taking the last hidden state).

Alternatively, you could try composing different UMAP models for different embeddings (see e.g. here for a discussion), but I’ve never tried that so cannot vouch for its utility.

@lewtun,
This is great, thanks for the insight. Really pleased to see that the version of UMAP you linked supports semi-supervised, which is perfect!

Will attempt the quick and dirty approach and report back.

Many thanks,
Chris

1 Like

Hi @lewtun,
Wanted to report back, did a lot of reading starting with the Universal Sentence Encoder (which I’d foolishly neglected in my previous passes over the literature). It looked like a great starting point but I was really looking for something like SciBERT that had the vocab needed to capture some of the more detailed parts of the data.

Landed upon DeCLUTR (git and paper) and it looks like we are onto a winner!

image
image

Many thanks for the input,
Chris

2 Likes

Thanks for the pointer to DeCLUTR - I hadn’t heard of it and it looks like a really interesting and simple approach!

Hi @lewtun,
Sorry to bother you on this again, just wanted to pick your brain on the optimal distance metric you found for UMAP? On their documentation they use Hellinger but this doesn’t work for negative values: Document embedding using UMAP — umap 0.5 documentation

Also wondered if you’d found a way to select the optimal dimensionality of the UMAP reduction in order to provide HDBSCAN with maximal info.

Any insight or papers in this area would be much appreciated.

Many thanks,
Chris

Edit: On a second search of their documentation I found a much more helpful entry: Using UMAP for Clustering — umap 0.5 documentation, but would still love to hear your findings.

Hi @FL33TW00D in my use case (emails), I was able to get good results with cosine similarity and 5 dimensions for the embedding space.

Although not strictly a metric, cosine similarity is nice because it doesn’t care about the size of the documents - if you need a proper metric then you could try using the L2 normalised Euclidean distance (Cosine similarity - Wikipedia). I wish I could say that I got the dim=5 value through some deep intuition of topology, but it was mostly a form of trial and error :slight_smile:

The other UMAP parameters were left on their default values, which incidentally is similar to those used in the top2vec paper: https://arxiv.org/pdf/2008.09470.pdf

I’m not aware of a principled way for deciding the optimal embedding dimension - perhaps you can try a simple gridsearch to see which one works best?

Hi @lewtun,
Thanks for coming back to me, this confirms all my own preliminary findings, but will set up a grid search for concrete proof.

Many Thanks,
Chris

1 Like