Document clustering and summarisation with GraphRAG

Suppose I have a corpus of documents that I want to cluster and summarise. There are an indeterminate number of parent clusters, and each parent may in turn have several tributary child clusters. I would like to identify both parent and child clusters, and generate LLM summaries for each.

My approach has been to use hierarchical agglomerative clustering to determine the number of parents so as to maximise the silhouette score, subsequently clustering document embeddings with this optimal number. I then repeat this process to determine the number of child clusters for each parent. Following this, for each cluster I extract several documents whose embeddings are closest to the cluster centroids, along with important keywords and key phrases from the cluster, to use in the LLM summarisation prompt.

This isn’t exactly a conventional RAG application since I seek to summarise the entire corpus, but I think it shares enough similarities as to be considered a kind of RAG.

I would be grateful for recommendations on how to improve this procedure. For example, I’m aware that GraphRAG can use community detection to identify clustered concepts. Would GraphRAG perhaps be more suitable for identifying parent and child clusters than my current approach? If so, would the LLM prompt take a different form than the key document, key phrase and keyword extraction that I’ve outlined?