Fine-Tuning LLMs for Complex Document-Based Dataset Creation Tasks

Alert: The text is too long, so consider reading it.

If you’re not reading this, wyd fr? :eyes::open_book:**

Seeking Advice: Fine-Tuning LLMs for Complex Document-Based QA Tasks**

I’m working on fine-tuning a language model using a dataset derived from unstructured documents (e.g., technical guides or regulatory manuals). My current approach involves extracting paragraphs from the source PDFs and prompting an LLM to generate multiple Q&A pairs per section. While this method is scalable and has helped me build a sizable dataset (e.g., 500,000+ Q&A pairs), it has a major limitation:

The generated answers are constrained to the context of the input section.

This becomes problematic when the correct answer to a question requires referencing multiple sections or tables across the document. For example, answering a question about the conditional usage of a specific variable might require synthesizing information from several chapters, appendices, and rule tables. The model, trained on isolated Q&A pairs, struggles to generalize or reason across sections.

Context: the pdf contains information about rules and regulations that need to be followed while creating a dataset for public use case. basically, think of it as a compliance related module.

My Goal
To fine-tune a model that can:

1.Understand and answer general questions about the document in context of violation, validation, compliance of rules as mentioned in the training pdf.
2.Reference and synthesize information from multiple parts of the document because in real life a human would answer from 5 tables and 10 different pages to curate one answer.
3.Provide accurate and contextually rich responses, similar to how a human expert would.

Challenges

1.Context window limitations during training data generation.
2.Lack of cross-sectional reasoning in the training samples.
3.Risk of overfitting to shallow Q&A patterns.

What I’m Exploring

1.Are there alternative data creation strategies that can better capture cross-sectional dependencies apart from Q&A pairs?
2.Should I consider multi-hop QA generation or document-level summarization as part of the dataset? By multi hop generation or summary level also, very very rare chances are there that the LLM will get an input of the most relevant chunks of data that will actually make a more meaningful Q&A pair.

P.S.: I know RAG (Retrieval-Augmented Generation) is a more suitable approach for this use case but am exploring this above problem!

I’d love to hear from others who’ve tackled similar problems. How did you structure your dataset to enable deep document understanding? Any tips or frameworks you recommend?

1 Like

Resources for now

@John6666 can we connect on discord manthan4688

1 Like

fine, but I can’t find it… btw, john6666cat on discord