LoRA Finetuning

Luli1409 · January 9, 2025, 12:12pm

Hello, I am currently trying to advance the topic of “talk with documents” in my company. Can you tell me whether LoRA fine-tuning is suitable for providing information from PDF files to my language model so that I can then query it specifically for this purpose?

John6666 · January 9, 2025, 12:22pm

It would be possible to tune the LLM or VLM itself to learn how to handle PDFs, but it would probably be quite expensive. (Like ChatGPT or Gemini)
For example, if the layout of the PDF is visually important, there may be no other way, but if the text is important, it will be cheaper to convert it to a normal program to some extent.
Some frameworks have functions to handle PDFs, and there also seem to be some conversion libraries for Python.
Also, to get an idea of the finished product, you can take a look at HF Spaces. You can also see the source code.

Alanturner2 · January 9, 2025, 1:15pm

Hello! It’s great to hear that you’re advancing the topic of “talk with documents” in your company. To address your question about LoRA fine-tuning for this purpose:

LoRA (Low-Rank Adaptation) fine-tuning is primarily used to adapt large language models (LLMs) efficiently by fine-tuning a small set of additional parameters while keeping the base model mostly frozen. It is a cost-effective way to specialize a model for a particular task or domain without requiring full model fine-tuning. However, for querying information specifically from PDFs, LoRA alone might not be the best standalone solution.

Here’s why:

PDF Parsing and Preprocessing: Extracting structured data from PDFs requires robust preprocessing pipelines to handle text, tables, images, and metadata. Tools like PyPDF2, Apache Tika, or LangChain’s document loaders can help with this.
Information Retrieval (IR) vs. Fine-Tuning: Instead of directly fine-tuning the model with PDF content, you can use Retrieval-Augmented Generation (RAG). This involves embedding the document content into a vector database (e.g., FAISS, Weaviate, or Pinecone) and using similarity search to retrieve relevant passages at query time. This approach avoids the need to modify the model itself.
When to Use LoRA: If your use case requires the model to deeply understand domain-specific concepts or language, LoRA fine-tuning can complement RAG by aligning the model’s responses to your domain. However, it won’t inherently solve the problem of querying PDF data without an IR layer.

A recommended pipeline:

Extract and preprocess PDF data into plain text and/or embeddings.
Store embeddings in a vector database for efficient retrieval.
Use your LLM (optionally fine-tuned with LoRA) to process retrieved passages and generate responses.

Frameworks like LangChain, Haystack, or LlamaIndex (formerly GPT Index) are excellent for integrating these steps seamlessly.

Hope this help!

Victorano · January 16, 2025, 5:42pm

RAG is the way to go! because the document will be embedded and stored in a vector DB which you will query (talk to the document) via a vector similarity_search with score. The results will be the contents of the document that have closer vector values compared to the vector of the search query.
Check out Langchain and langchain_chroma (Chroma Vector database).

Topic		Replies	Views
Fine tune LLMs on PDF Documents Models	29	31496	March 3, 2025
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2607	November 5, 2024
RAG or LoRA for "Talk with Documents" Beginners	4	307	January 16, 2025
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3251	September 3, 2024
How to use a LLM for specific task Beginners	2	81	March 14, 2025

LoRA Finetuning

Related topics