FDA Label Document Embedding

FL33TW00D · February 15, 2021, 10:01pm

Hi everyone,
I am looking for any ideas or advice that you guys may have obtained in similar situations.

I have been working on an NLP task to cluster medical documents for some time, and whilst I am eager to use transformers to get the best results, through all my efforts it seems that TF-IDF has worked best.

I am working with the SIDER side effect dataset, which provides annotated FDA medication labels, an example is here:
http://sideeffects.embl.de/media/pdf/fda/17106s032lbl/annotated.html#C0026961_0

I have tried TF-IDF and SciBert through sentence transformers, selecting the most relevant passages, but no amazing results yet. Does anyone have any ideas or previous experience?

Many Thanks,
Chris

lewtun · February 15, 2021, 10:21pm

Hi @FL33TW00D, I ran into a similar problem last year with TF-IDF and found the following approach gave better results:

Encode the documents, either with your favourite Transformer or Universal Sentence Encoder (the latter works really well!)
Run UMAP on the embeddings to perform dimensionality reduction
Cluster with HDBSCAN

HTH!

FL33TW00D · February 15, 2021, 10:46pm

Hi @lewtun,
Thanks for the response.

How did you manage to encode the entire document? Did you perform summarization or did you split it up into chunks and average?

I’ve already included steps 2 and 3 in my pipeline, I feel its the representations that are holding me back! Do you think I should make an attempt to somehow include the annotations provided by the dataset into the representations?

Many Thanks,
Chris

lewtun · February 15, 2021, 11:11pm

In my case the documents were short emails, most of which could fit in the 512 token limit of USE - I did not try fancy things like summarization / chunking, but the latter would be my first thing to try for a long document

Regarding the annotations, they might help, but you’d have to think carefully about how you plan to combine them with the embeddings before applying UMAP.

Perhaps a “quick and dirty” approach would be to experiment with is concatenating the hidden states from multiple layers to see if that improves your document representation (assuming you’re just taking the last hidden state).

Alternatively, you could try composing different UMAP models for different embeddings (see e.g. here for a discussion), but I’ve never tried that so cannot vouch for its utility.

FL33TW00D · February 15, 2021, 11:26pm

@lewtun,
This is great, thanks for the insight. Really pleased to see that the version of UMAP you linked supports semi-supervised, which is perfect!

Will attempt the quick and dirty approach and report back.

Many thanks,
Chris

FL33TW00D · February 16, 2021, 10:29pm

Hi @lewtun,
Wanted to report back, did a lot of reading starting with the Universal Sentence Encoder (which I’d foolishly neglected in my previous passes over the literature). It looked like a great starting point but I was really looking for something like SciBERT that had the vocab needed to capture some of the more detailed parts of the data.

Landed upon DeCLUTR (git and paper) and it looks like we are onto a winner!

Many thanks for the input,
Chris

lewtun · February 17, 2021, 8:32am

Thanks for the pointer to DeCLUTR - I hadn’t heard of it and it looks like a really interesting and simple approach!

FL33TW00D · February 17, 2021, 9:49pm

Hi @lewtun,
Sorry to bother you on this again, just wanted to pick your brain on the optimal distance metric you found for UMAP? On their documentation they use Hellinger but this doesn’t work for negative values: Document embedding using UMAP — umap 0.5 documentation

Also wondered if you’d found a way to select the optimal dimensionality of the UMAP reduction in order to provide HDBSCAN with maximal info.

Any insight or papers in this area would be much appreciated.

Many thanks,
Chris

Edit: On a second search of their documentation I found a much more helpful entry: Using UMAP for Clustering — umap 0.5 documentation, but would still love to hear your findings.

lewtun · February 19, 2021, 9:26pm

Hi @FL33TW00D in my use case (emails), I was able to get good results with cosine similarity and 5 dimensions for the embedding space.

Although not strictly a metric, cosine similarity is nice because it doesn’t care about the size of the documents - if you need a proper metric then you could try using the L2 normalised Euclidean distance (Cosine similarity - Wikipedia). I wish I could say that I got the dim=5 value through some deep intuition of topology, but it was mostly a form of trial and error

The other UMAP parameters were left on their default values, which incidentally is similar to those used in the top2vec paper: https://arxiv.org/pdf/2008.09470.pdf

I’m not aware of a principled way for deciding the optimal embedding dimension - perhaps you can try a simple gridsearch to see which one works best?

FL33TW00D · February 19, 2021, 10:30pm

Hi @lewtun,
Thanks for coming back to me, this confirms all my own preliminary findings, but will set up a grid search for concrete proof.

Many Thanks,
Chris

Topic		Replies	Views
[HELP]Bart summarization output exactly the same as labels 🤗Transformers	3	852	August 4, 2021
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	504	February 13, 2023
Clustering news articles with sentence bert Models	15	19979	October 29, 2023
Sentence Similarity or Sentence Classification Task? Beginners	6	945	March 11, 2021
Short text clustering Beginners	3	6895	April 30, 2021

FDA Label Document Embedding

Related topics