How to summarize pdf?

sbrawer · August 19, 2025, 5:22pm

I would like to use Gemma (or some AI) for answering questions based on the information in a very large set of pdfs on my hard drive? The topics are: organic chemistry/biochemistry/cell biology. How do I start? I really need very specific instructions. At some point the info set can include docs on the internet as well.

The online Gemma told me (after three queries) it cannot accept pdfs. Those three queries are the first time I have used an AI.

Thank you

John6666 · August 20, 2025, 1:42am

It is necessary to first perform OCR on the PDF to convert it to plain text. Additionally, since overly long text can be challenging for LLMs, the text in the following example has been shortened in advance. The following example uses dotcr, but Tesseract is also a good option. There are various OCR models available, so it is advisable to select one that suits your specific use case.

# pip install -U python-doctr[torch] sentence-transformers transformers>=4.50 accelerate huggingface_hub[hf_xet] requests numpy<2
import io, re, requests, numpy as np
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
PDF_URL  = "https://nlsblog.org/wp-content/uploads/2020/06/image-based-pdf-sample.pdf"
EMB_ID   = "sentence-transformers/all-MiniLM-L6-v2"
LLM_ID   = "unsloth/gemma-3-270m-it-qat-bnb-4bit"   # or any LLM like "Qwen/Qwen2.5-0.5B-Instruct", etc.

# 1) OCR the PDF (docTR)
pdf = requests.get(PDF_URL, timeout=60); pdf.raise_for_status()
doc = DocumentFile.from_pdf(io.BytesIO(pdf.content))
ocr = (ocr_predictor(pretrained=True)).to(DEVICE)
res = ocr(doc).export()

text_lines = []
for p in res["pages"]:
    for b in p.get("blocks", []):
        for ln in b.get("lines", []):
            w = [x["value"] for x in ln.get("words", [])]
            if w: text_lines.append(" ".join(w))
full_text = "\n".join(text_lines).strip()
if not full_text:
    raise RuntimeError("OCR produced empty text")

# 2) Embedding-based pre-shrink (top sentences by cosine to document centroid)
sents = [s for s in re.split(r"(?<=[.!?])\s+", full_text) if s.strip()]
emb = SentenceTransformer(EMB_ID).to(DEVICE)
E = emb.encode(sents, convert_to_tensor=True, normalize_embeddings=True)
centroid = E.mean(dim=0, keepdim=True)
scores = util.cos_sim(centroid, E).cpu().numpy().ravel()
order = np.argsort(-scores)

tok = AutoTokenizer.from_pretrained(LLM_ID, use_fast=True)
max_ctx = getattr(tok, "model_max_length", 128_000)
reserve = 512

selected, tok_count = [], 0
for i in order:
    ids = tok(sents[i], add_special_tokens=False).input_ids
    if tok_count + len(ids) > max_ctx - reserve:
        break
    selected.append(sents[i]); tok_count += len(ids)
reduced_text = " ".join(selected)

# 3) Summarize once with Gemma-3 270M IT
model = AutoModelForCausalLM.from_pretrained(LLM_ID, device_map="auto")
gen = pipeline("text-generation", model=model, tokenizer=tok, device_map="auto")
msgs = [
    {"role": "system", "content": "Summarize long documents accurately and concisely."},
    {"role": "user", "content": f"Summarize it:\n\n{reduced_text}"},
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = gen(prompt, max_new_tokens=512, do_sample=False, temperature=0.0, return_full_text=False)[0]["generated_text"]

#print("Full text:", full_text)
#print("Reduced text:", reduced_text)
print("Summary:", out.strip())
#When using image-based PDFs, such as those created by scanning or photographing paper, it's important to determine their format so you can understand how to interpret the content. If the file appears in an image-based format, it might contain searchable text. However, without this information, it's challenging to fully understand the document. For instance, if someone provides an image-based PDF with no searchable text, you should inquire about the format of the original file and whether it has been converted into a digital version. This approach helps ensure accurate interpretation and understanding of the document contents.

sbrawer · August 20, 2025, 6:53pm

John

Thank you very much for this.

There seem to be Python functions for converting a pdf to text directly. For example
https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file-via-python
which is just the first thing I looked at..

Can an AI combine the text with figures in its analysis? I am interested in quite technical articles. Technical articles frequently contain figures which are referred to in the text and which are significant for the overall meaning? This would affect the results of any summary, to say nothing of comparing different articles, which is really my main goal.

Do you think an AI would be useful for very technical articles? Every respectable technical article has a substantial abstract. Would an AI be expected to go deeper than merely repeating the abstract (perhaps in different words)? Would an AI be expected to question a conclusion of the article?

It would be very nice to have some idea of how successful this sort of venture might be expected to be.

Thanks again

Steve

John6666 · August 21, 2025, 12:55am

I don’t really trust my judgment when it comes to the success or failure of a business…

There seem to be Python functions for converting a pdf to text directly. For example
How to extract text from a PDF file via python? - Stack Overflow
which is just the first thing I looked at..

That’s right. If the text is embedded in the PDF as text appropriately, that method can be used. However, in actual PDFs, the text is often not accurate and is effectively just an image… That’s why many people convert PDFs to images and then use OCR. This is also because it makes it easier to reflect information from the visual layout.

Proper text conversion of images containing structured information (structured document image) like this is one of the key areas that major AI companies are currently working on.

sbrawer · August 21, 2025, 3:30am

I see. That IS interesting. So far, for all the pdfs I have worked with, I can select text, copy it and paste it into a word processor - I’m on a mac so pages is simplest. This seems to indicate that, for the pdfs I have, the text is actually text. Problems arise with embedded equations (and there can be a number of these). So in some cases it could be a lot of work to get the text into the right form.

But even more to the point, there exists an application called HoudahSpot. I can enter any phrase, or a number of phrases in its API, and it will search through all my pdfs (in whatever directory or directories I want) and it will list the all the particular pdfs with that phrase. It must be converting the pdf text into some form it can search. Lots of other applications can also search for text in pdfs.

If HoudahSpot can do this, why can’t an AI do this?

When an AI is trained, does it not consider all the pdfs it can get its claws into for training? If not, then such a system would effectively be useless for any scientific investigation, because a great deal of results of investigations are in the form of pdfs on the internet.

Something doesn’t ring right here.

Thanks again.

Steve

John6666 · August 21, 2025, 10:20am

Gemma does not support direct PDF reading because it is fundamentally an LLM. LLM don’t have built-in mechanisms or memory specifically designed for reading PDFs normally.

For clean PDFs or other document data where words can be extracted without OCR, it is common to use RAG or agentic frameworks. This approach is cheaper and more reliable.
In such cases, you would create a program that combines a small generative AI.

Alternatively, you could convert the PDF to Markdown or another format beforehand. It is more efficient to do things that can be done with a program using a program.

The challenging examples mentioned earlier do not refer to clean PDFs. The real challenge lies in understanding the content of diverse structured documents from the real world, where you need to visually integrate not only text but also layout information and other contextual elements. This is an area that remains largely unexplored.

sbrawer · August 21, 2025, 10:40am

John

Thank you very much for clearing that up, and for your patience.

I suspect I will have to rely on key word search and my very own intelligence (such as it is) for the foreseeable future for this particular project. Given all the current hype about AI, and my ignorance regarding AI, it is difficult to know what is available or how well it might function in any given situation.

Thanks again. I appreciate your taking the time. I won’t be using AI soon, which I think is too bad.

Steve

Topic		Replies	Views
Read data of pdf or just image format as a part of promt Intermediate	0	1359	May 29, 2023
Chatting with pdf (with reasoning capabilities) Beginners	2	215	February 4, 2025
Chat agent for multiple documents (billing invoices PDFS) 🤗Transformers	0	307	January 8, 2024
Pegasus Summarization API_Inference Beginners	4	328	May 28, 2021
PDF generation from Generation models Models	0	655	January 22, 2024

How to summarize pdf?

Related topics