Hello All!
I am working with some very complex PDF files that contain CAD’s and Multi Directional Text on the same pages, like the attached pic
I have tried a few py scripts in BERT and Layout and had no luck… Anyone have any suggestions on how I can get this information out of this crap and into a Model so we can learn from it and then automate pieces in Excel and such?
Thanks much!
1 Like
I wonder if this is suitable for relatively messy PDFs with existing software.
But even this might not be enough…
https://pypi.org/project/pymupdf4llm/
How about using OCR?
You can get text using simple code.
pip install pytesseract pillow pdf2image
And the run this code
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
# Path to the PDF
pdf_path = "your_file.pdf"
# Convert PDF pages to images
images = convert_from_path(pdf_path)
# Perform OCR on each page
for i, image in enumerate(images):
# Extract text from the image
text = pytesseract.image_to_string(image)
print(f"Text from page {i + 1}:\n{text}\n")
Hope this help!
1 Like