Models for reading Schematic PDF's

Hello All!

I am working with some very complex PDF files that contain CAD’s and Multi Directional Text on the same pages, like the attached pic

I have tried a few py scripts in BERT and Layout and had no luck… Anyone have any suggestions on how I can get this information out of this crap and into a Model so we can learn from it and then automate pieces in Excel and such?

Thanks much!

1 Like

I wonder if this is suitable for relatively messy PDFs with existing software.
But even this might not be enough…
https://pypi.org/project/pymupdf4llm/

How about using OCR?
You can get text using simple code.

pip install pytesseract pillow pdf2image

And the run this code

from pdf2image import convert_from_path
from PIL import Image
import pytesseract

# Path to the PDF
pdf_path = "your_file.pdf"

# Convert PDF pages to images
images = convert_from_path(pdf_path)

# Perform OCR on each page
for i, image in enumerate(images):
    # Extract text from the image
    text = pytesseract.image_to_string(image)
    print(f"Text from page {i + 1}:\n{text}\n")

Hope this help!

1 Like