Document clustering and summarisation with GraphRAG

JJ87 · July 23, 2024, 9:36am

Suppose I have a corpus of documents that I want to cluster and summarise. There are an indeterminate number of parent clusters, and each parent may in turn have several tributary child clusters. I would like to identify both parent and child clusters, and generate LLM summaries for each.

My approach has been to use hierarchical agglomerative clustering to determine the number of parents so as to maximise the silhouette score, subsequently clustering document embeddings with this optimal number. I then repeat this process to determine the number of child clusters for each parent. Following this, for each cluster I extract several documents whose embeddings are closest to the cluster centroids, along with important keywords and key phrases from the cluster, to use in the LLM summarisation prompt.

This isn’t exactly a conventional RAG application since I seek to summarise the entire corpus, but I think it shares enough similarities as to be considered a kind of RAG.

I would be grateful for recommendations on how to improve this procedure. For example, I’m aware that GraphRAG can use community detection to identify clustered concepts. Would GraphRAG perhaps be more suitable for identifying parent and child clusters than my current approach? If so, would the LLM prompt take a different form than the key document, key phrase and keyword extraction that I’ve outlined?

Topic		Replies	Views
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2532	August 31, 2021
Extracting and adding document clustering features to a document classification model Models	0	783	March 30, 2022
Short text clustering Beginners	3	6899	April 30, 2021
Clustering news articles with sentence bert Models	15	19986	October 29, 2023
Seeking Advice on Processing Support Conversations for Efficient RAG Model Search Intermediate	0	50	September 9, 2024

Document clustering and summarisation with GraphRAG

Related topics